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PREFACE. 


The present course of Lectures on the Measurement of 
Groups and Series deals with some of the most modem 
methods of statistical research. Interesting as they were to 
those who had the advantage of hearing them delivered, 
they will doubtless^ when studied at leisure in printed form, 
prove even more interesting and useful. 

These Lectures are the fifth of a Series originated in 
1897, designed for the assistance of Actuarial Students in 
connection with matters not included in the oflScial Text 
Books. Three of the Series deal with legal matters, and 
one with the subject of Stock Exchange Securities. The 
present course carries the range of topics into the field of 
mathematics, and it is hoped that courses of lectures may 
be hereafter provided dealing with other subjects, practical 
and theoretical, relating to those branches of knowledge 
which it is the province of the Institute of Actuaries to 
promote and encourage. 


W. H. 




2Mh February, 1903. 
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Note,— I n :lir ieftiires I have made free use of those 

metliids and whieh ('though in many cases of recent 

origin - may new perhaps be regarded as common |)ro|>erty; but I hope 
that i have not inadvertently quoted without reference, or misquoted, 
iiivotigations or theories, wliicli may be regarded as personal to any of the 
small Wdy of statisticians working on the subject treated. The lectures 
had to be prepared bc»th for delivery and for the Press at short notice in the 
laidst of a busy sessicm. This fact must be my apology for any obscureness. 
iinnecesfaTT repetition, or eliiinsiness of arrangement or expression, which 
may be found. 


A. L. BOWLEY. 



LSUKEMEXT OF GROUPS. 


FIRST LECTURE. 


lEX, it was witli considerable diffidence that I 
to lecture to members of the Society of Actuaries 
ct with which they may be presumed to be so 
iVTien I was asked if I could undertake these 
ad some difficulty in choosing a suitable subject, 
occurred to me that my audience were probably 
with the practical aspects of a question which I 
considering from the theoretical point of view^, and 
id therefore be most suitable if I endeavoured to 
them some theoretical considerations on subjects 
Lot come in their ordinary course, but which were 
3 subjects which naturally come before them, and 
illied to those subjects on which I have spent a 
unt of time and attention. 

Geoups. 

} subject which I have selected is the measurement 
the characteristics of a group, and its representa- 
group I understand a number of persons or things 
bich possesses a measurable characteristic, the 
g arranged according to the magnitude of the 
ic. JFor example, if I have returns of the wages 
umber of people, and I group them according to 
, saying how many are earning 205. to 255. , and 
ball have such a group ; or if I choose a section of 
ion and group them according to ages, I should 
3r group of the kind I am thinking of. The 

B 


1 4iiul Imw to make akoiit groups will, I ]io| 
fairlr ifriiiTal ami apply Tm a very large range of gr< 
0 . 1 " euiiveiiieiice of I sliall confine iiiys^ 

-ii’y a small number. 

Hie |u'irtieiilar gi-suip I am taking for discussion 
t‘Vi*ning i- taken from the current Census, the iiimibc 
niarried wijinen in the county of York, on Census day, 
UT^^iiped according to their ages. In selecting a groii 
ilhcu^A^m it must be large enough and small enougi 
Tinmt be sufficiently large to conceal individual peciilia: 

i^eeiiliarities of small sections; it must be sufficiently 
n.s be hom«»geneoiis. Both these limits are relative. A | 
I hat is large enough for one purpose is too large for an 
imrpose ; and a group that is homogeneous for one is hi 
gene« »iLs for another. The death rate of a whole countrj 
be sufficient for certain comparisons, but for other compai 
you must subdivide according to districts and age. The 
that has been selected must be kept in view before 
argiinieiits are based on the grouping and its measureme 

There are two main divisions of groups : those tha 
derived frf.)m exact observations, and those which ms 
regarded as samples of a larger group the whole of whic’ 
not been measured. For purposes of reference I am a 
them Group a, wffien the observations are supposed i 
correct, as, for example, the number of persons who a 
receipt of a certain income : and Group jS, wffiere the nur 
are estimates; for example, an estimate of the numb« 
persons who may be expected to be in receipt of ct 
incomes ten years hence, from an investigation of some ^ 
now, or at some previous time. As regards Group a. 
chief work will be to select some method of abbrevig 
of describing in brief, each group ; in the case of Groi 
our work will be chiefly to criticise the correctness o: 
statements, and to find methods which are pro 
applicable for its correction if it is not exact, to measui 
precision, and then afterwards to select some suitable me 
of abbreviating it. 


The Graphic Method. 

The two chief methods of abbreviating or investigi 
the characteristics of a group are the graphic method 
the method of averages. The method of averages si 


|ii-rliap> lie reft*rred to first : but^ since the use of diagrams in 
explaining tlie meaiiing of averages is very considerable^, I 
licive thought it better to take the method of dia, grams first. 
I have drawn out, in four different wavs, the group already 
named, the number of married women in the county of 
York. 

A^es of Wives present icitJi their JSusbands in the Registration 
Countg of Yorh^ 1901. 





No. per 1,000. 


Per 1,000. 

Between — 

15 and 16 vears 

•01' 


Not more than — 
16 years old 

•01 

16 „ 

17 


•03 


17 „ 

•04 

17 „ 

18 


*2 

- 5 

18 „ 

•26 

18 „ 

19 


1-2 


19 „ 

1-5 

19 „ 

20 


3-4 J 


20 „ 

5 

20 „ 

21 


8 

|83 

21 

13 

21 „ 

25 


75 

25 

88 

25 „ 

30 

>> 

157 


30 

245 

30 „ 

35 


162 


35 

407 

35 „ 

40 


147 


40 „ 

1 554 

40 „ 

45 


125 


45 „ 

! 679 

45 „ 

50 


105 


50 „ 

i 784 

50 „ 

55 


80 


55 

t 864 

55 „ 

60 


55 


60 „ i 

i 919 

60 „ 

65 


40 


65 „ 

959 

65 „ 

70 

JJ 

22 


70 „ 1 

981 

70 „ 

75 


14 


75 „ 1 

995 

75 „ 

80 


4 


80 „ ! 

999 

Above 

80 

99 

1 

1,000 


1 



Total number included, 610,505. 


There are shown in Diagrams I to IV, the numbers of married 
women in that county per thousand betiveen these ages. The 
total of wives in the county of York living with their husbands 
was 610,0CK) odd. As is usual, the numbers are divided in years 
between the ages of 15 to 21, and after that in five-yearly 
groups. The first method of representing figures by diagram 
is to place a dot in a given vertical position for each person 
or item in question. This is indicated in Diagram I. The 
method is not very important and is perfectly obvious. I 
should only use it as a means of passing to another, if it were 
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l^.. — Poi5T DiIGEAM. 


' Pit 

Mi7/£ 

— 

_ 125 

I 

i„ioo 

1 

i 

Les 


rM i N M 11 M M ^ 

i 1 1 ■ * ■ I ' . ' I I I ‘ I • I 

Years ' 20 OS 38 55 ^0 ts 50 55 60 65 70 75 80 S 5 

not tliat ill tlio.-e classes of measurement wliere the quantity 
separated by a finite interval it is incorrect to use the met 
shown in Diagi^ams II, III and IV. If one was ent< 
the number of houses at particular rents in a town, wh( 
might perhaps he supposed that the rent always jumped 1 
much as £2, one could represent properly the nuiiib< 
houses at each £2 mark, but there would be no house a 
intermediate intervals, and it would be incorrect to prc 
any further to such a curve as would lead one to suppose 
the quantity dealt with was continuous. To take an< 
example, the railway service from one town to another n 
be represented by a series of dots placed vertically ove: 
time taken by the train, measured horizontally, hut not b 
following methods. If, how’ever, the quantity is capab 
continuous variation, such as age or height, or if h}’' a s 
extension of the meaning it may be regarded as being ca' 
of continuous variation, such as income, we may proce 
the method of Diagram II. 

In Diagram II rectangles are drawn whose height; 
the saine as the height of corresjxinding lines of dc 


fears IS 20 25 30 35 40 45 50 55 60 85 70 75 80 85 

am I, but tlie breadth is the unit of abscissae^ in this 
ve years. The areas can be regarded as representing a 
3r of persons. The area of the whole space enclosed 
e outer lines of the rectangles is on the scale chosen^ 
bre inches, which represents the whole of the population 
ered; the breadth of each rectangle* is i inch, and 
, squared represents 1 per cent. 

fore we can go any further we have to make some 
ption as to the distribution of persons within the five- 
intervals selected. Even in my class, (a), when the 
rations are known to be correct, some assumption must 
de as to distribution before proceeding further. If, for 
)le, the correct set of measurements of the heights of a 
ent was given, every soldier being measured correctly 
nearest ^ of an inch, no correction would be required 
itual mistakes, but before a continuous curve could be 
L passing from one ^ inch to the next, some assumption 
be made as to distribution of heights, e.g.^ that pro- 
on was uniform between the given points. In the case 
* The areas for ages helow 25 are shown in more detail. 


fi 

-f ^4-ervatio!is (3) wliieb are only samples, it is still 
!,i>*L'e->arv tu make some assumption as to eoiitiiiiiitv 
cuosider what assumption is proper, in the ci 
M]le^tiuIl, remember that the facts given exactly ar 
liiiijibers of persons whose ages lie between certain 1: 
thiii is. we are given the area of the rectangle, or c 
which replaces the rectangle on each unit of 
What we have to suppose is that the ages are subdivide 
merely into years, but into infinitesimal units of time : 
we have to make some assumption for guiding us in p£ 
from one of the given positions to the next. There are a 
prisirioiis which give definitely the number of persons 
*20 years, below 25 3'ears, and so forth. "We have to fi^r 
iiiiiiiber of persons below 28, or any other assigned age. 
is a quite familiar idea; but there are one or two thin 
eoiineetion with it wdiieh it is necessary to point out. 

The Histogeam axd the Ogive. 

If straight lines are drawn from the middle of 
horizontal line in Diagram II to the middle of the ne 
get the dotted line in Diagram III (called a histog 


III. — Histogeam. 



/ 


Tliat is certain to be incorrect on two grounds. In tlie first 
pkeCj the area bounded by the lines nearest the highest point 
is necessarily too siiiall, for part of the area between the ages 
30 and 35 is cut off by the dotted line, and nothing is placed 
instead of it. Before pointing out the other way in which 
the histogram is necessarily incorrect, we will pass on to 
Diagram IT, which is thus constructed. At the ages sho^m 
on the horizontal axis are drawn rectangles proportional to the 
number of persons below that age ; then we get a continually 
ascending figure called an ogive, which is given as absolutely 
correct in group a, the points at the corners of the steps 
obtained in the figure being given by assumption. The 
problem comes to be to draw some line or curve from these 
fixed points that shall satisfy the conditions which we must 
assign. Now, it would be necessarily wrong to join these 
successive points by straight lines. If we take three corners, 
A, B, C, not in a straight line, we get a sharp angle at B. 
Introducing sharp angles there necessarily involves an error, 
for they indicate discontinuity at certain arbitrary points, 
which can correspond to no facts in nature. The angles 
obtained in the histogram are erroneous for similar reasons. 
If, as in group a, we are to suppose the observations to be 
correct, a continuous line must be drawn through all the given 
points which has no sharp angles in it, no sharp change 
of curvature. If, as in group /?, we are not bound to assume 
that the observations are correct, the line may be drawn not 
passing through the points, but near them. Many groups 
may be represented with sufficient accuracy in rough work by 
drawing a freehand curve passing through the given points ; 
it will be found that there is very little margin for drawing 
such a curve, if the rule is made that the curvature is never 
to be greater than necessary, that the direction is not changed 
more rapidly than is necessary to pass through the points. 
This condition, stated in mathematical language, supplies the 
main problem of interpolation. 

In group ^ we are not bound to assume that the curve 
passes through all the points, and the question which is the 
best curve (drawn freehand or otherwise) near the points, 
needs the theory of probability for its discussion. 

Inteepolation Cuevb. 

As regards group a, to which discussion may he confined 
for the present, where the curve is to pass through all the 



o 



points, I suggest tlie familiar metiod of interpolation by a para- 
Ik>1c formnla. Take tbe equation f = % -f aix + -f . . 
continued to as many terms as are convenient. In the group 
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iiiider discussion it would be inexpedient to take more than 

4 or 5 terms, because we are fitting a definite algebraic curve 
to irregular observations, and tie law wiich underlies tie 
observations may very well ciange if we take a larger 
period tian 25 years. I iave confined tie work for tiis group 
to 5 terms, continuing tiat series to 

Consider wiat conditions tiat curve satisfies. Stopping at 
tlie second term we iave a straigit line, wiici can only be 
made to pass tirough two points. So we have to start afresh 
at the second point, and thereby contradict the first assumption 
which we make, that the increase is not subject to violent 
changes, introducing an angle at any point. Introducing the 
second term x- we have a parabola, which can be made to pass 
through three points ; the curve has continuous curvature, the 
third differential, and the third differences obtained from the 
values of y at three consecutive equidistant values of Xy 
vanish ; that is to say, there is no sudden change in curvature. 
The first differential measures the inclination of the line ; the 
second measures the change of inclination, and if that is 
constant there is a constant change of inclination, but no 
sudden break. But we have no reason for assuming a constant 
change of inclination, and the curve which passes through 
three assigned points ^vill not in general pass through the next 
point. We then proceed to include further terms. If we take 
the equation up to we can introduce a point of inflection, 
which we cannot do with the parabola. If we take tie 
equation a step further we introduce two points of inflection, 
and it is unnecessary to go as far as the fifth term in a 
diagram like this. If we take 6 terms, the 6th differential and 
the 6th difference vanish, the 5th difference is constant, and 
there is no sudden break, and so on. 

JSTow take the equation as far as the term We have 

5 unknowns, and can determine them by assigning the 
condition that the curve . shall pass exactly through 5 points 
on the diagram in question. I have calculated the equation 
of the curve which will pass exactly through the 5 points of 
the ogive, corresponding to 20, 25, 30, 35 and 40 years. The 
method of calculation is a good deal facilitated by the use of 
finite differences. Eefer as origin for the abscissae to age 20^ 
take 5 years as unit, so that a? is 1 at age 25, and write down 
the equations which naturally arise, taking the numbers from 
the last column of the table on p. 3. 



-I = (Iq 


(U 

245 = #1,5 -f fl j . 2 4- • 2- 4- fi3 . 2-^ -j- a i. 2^ 

4o7 = «o + ai .3 i- % . o“ -|- ff3 • 4- 

55 4 = fi,j -f- 1 . 4 -f r?o , 4^ -f j . 4'^^ -r «4 ■ 4^ . 

li is easily sliown that «4=Af,-i-24, ( 73 = A;; h - 6 — A, "5 “-4^ 
G.j = A'-r-2— A.”;~24--H' easily 

eaieiikteiL In this ease A*=49, A::=— 09, Ag = 74, a4 = 2^y 
03=— 234, (in=z9S^, ai=iO|. 

Let .:=/>) be tlie equanon of the curve replacing the 
liistograiiq and ^ = F;.-/; the equation of the ogive. Then 

i/=i r.tl,/, and By means of these equations^ the 

paraliolie or siiiootlied curve now obtained from Diagram IT 
can be used to furnish values to replace the histogram of 
Diagram III. The same unit for x (five years) must, of 
course, be used in both cases. Thus, if af=l (25 years), 

=a, + 2a.,.H-3a3.P + 4a4.P = l-35-6. 

(Lc 

At 30 years, 2=166*9; at 35 years, 158*8. 

The curve obtained in this wmy is shown in the continuous 
line in Diagram III ; this curve satisfies the conditions that 
the areas standing on the 5-3’ear bases, from 20 years to 
45 years, should represent on the chosen scale the number of 
persons given by the original table, and that there should be 
no abrupt changes of curvature. 

Since the curve has been chosen so as to satisfy the 
conditions for only five age periods, it will not necessarily 
satisfy any more; but in this case the curve merges into 
a straight line, which approximately fulfils the conditions till 
65 years. If we need greater accuracy in later years, we 
should calculate new values for the ak and obtain a second 
carve, satisfying a new group of area conditions. If we 
needed to draw the whole curve accurately, we should have 
to devise a method of passing without a break of continuity 
froin one such parabolic curve to the next ; hnt, as it is, we 
only want means of obtaining specified points on the curve, 
and that can be done by choosing the special parabolic curve 
that is in the neighbourhood of the required points. 



AvEIL\G£S. 

Ihe Mode. — A t tlic liigliost point of tlio sinootli cufto in 

Diagram III, =0 ; lienee ^ = 0 in tlie ogire fc^r tlie same 

value of a!. Tims, as is otlier^vise evident, tlie oirive i> 
steepest and there is a point of inflexion, at that value of .r 
wliicli gives tlie greatest ordinate in Diagram III. 

If have 0=2a2-f-6a3r4- 12 a 4 r“, and 

, 1 ?=: ( — 3 £? 3 4- ^/ Qcq — 24a2«4 j 1 2 ^4 , 


where ^33 ^4 are given in terms of the differences above. 
Writing in these values, we have a’ = 2*021, which corresponds 
to 30*10 years. 

If we had included a further term, we should have 
a cubic to solve to determine x. If we had only gone as 
far as we should have the equation 0 = U 2 -f 3aar; that 
is, ir= but this formula is unsafe, unless the 

fourth differences of the original figures are approximately 
zero. 

The equation taken as far as the x* term appears to me 
to be practically the best in the example we are discussing. 
If we choose the coefficients to satisfy the conditions starting 
from the age 25, we obtain 30*48 years as the position of 
the highest point. The discrepancy between this and the 

30*10 years found from the parabola starting from the age 

20 years, arises from the indeterminateness of the original 
figures. It seems best to take the value from the former 
curve, as the point then lies near the middle of the assigned 
values. We adopt then the age 30*10 years as the required 
age, and find that 166*94 at that age. The age so found 
is called the mode of the group ; it is also called the position 
of greatest density and of the maximum ordinate. 

The Mediax. — ^The abscissa of the point (M) where 
the ogive is cut by the horizontal line half way up the 

scale (from 0 to 1,000) is called the median. In the 

histogram, or the smoothed curve which replaces it, the 
vertical through the median divides the curve into equal 
areas. When the ogive is drawn, the median can at once 
be found graphically. To find it algebraically, take a 
parabolic equation as before, satisfied by five points lying 
near the median, obtain the coefficients as before, and put 



= f>ne of the the equation >o obtained is the 

lia-iiiaiL Starting at years we have — 

24d ^ 1 b4 ‘ 4- 500, 

niiii solviiiLT by ilorner’s method, :^= 1*623, so that the 

iiieiliaii age 30 5. =38*11 (years). 

The fpmriile^ are the abscisste of the points (Qi, Qo) 
where ilie liorizontal lines one-quarter and three-quarters up 
the Seale Trcmi i) to l,CHjUi cut tlie ogive. The vertical 
iliroiigli tlie.'e al)>eissie in Diagram III would, together 
with tile mecliaii vertical, divide the area into four equal 
piirs>. The quartiles can be found from one of the equations 
already written by putting i/ = 250 and 750 successively, 
and s^jlviiitf for ; or they can be found graphically. 

A rough metliod of finding these points, often sufficiently 
accurate, and saving a more laborious solution, is to assume 
that the parts of the ogive between the corners which 
contain the median are straight lines. There are 407 (per 
tlioiisandj below 35 years; 93 out of the 35 to 40 year 
group, which contains 147, are to be taken to reach the 
median, which is on the hypothesis of a straight line 
(35 +-^ 4 ^ of 5} years, that is 38*17 years, a value differing 
little from that already obtained. The lower quartile by 
either method is 30*10 years, the upper 48*6 years. 

W e have now the following figures : — 


Lower quartile 
Mode ... 

Median 

Arithmetic average . . . 
Upper quartile 


30*16 

30*10 

38*11 

40T11 

48*6 


years. 


The arithmetic average is calculated directly in the 
ordinary way, hut is of little importance in such a group 

as this. 

If a person is taken at random from this group, her 
most probable age is 30*1 years, the mode. It is as likely as 
not that she will he over 38*11 years, the median. It is as 
likely as not that she will he between 30*16 and 48*6 years. 
The chances are 3 to 1 against her being less than 30*16 
years ; 3 to 1 against her being over 48*6 years. 

Other points can he obtained by dividing the group into 
ten equal parts, or one hundred equal parts. These are called 
the decihs and pertentiles respectively. 



X SJLK:^ illulAr:. iii <xss iJt l/’l 

special importance^ as being the most probable valae. It 
is entirely unaffected by the extremes. If the Census 
authorities had omitted all married women over 50 or all 
under 20 years in their enumeration the mode would be still 
ill the same place. That is very important when we are 
dealing with inaccurate figures. In those curves which have 
a distinct mode, where the curve first tends upwards^ reaches 
a height^ and then comes down again without ever pausing or 
returning to a second height, and where there is a certain 
symmetry or similarity of distribution on either side of it, in 
such curves the mode is of special importance. If, on the 
other hand, you have a regular mountain range represented 
by your curve, the mode is probably of much less importance. 
If you have a single peak it is probably of importance. But 
though it is important in itself it is quite insufficient to 
describe the curve ; it only tells you the position of one point ; 
it does not tell you the steepness on either side, or the distance 
from there to any assigned point. 

The median is affected by extremes to some extent. If 
the authorities had omitted all the married women over 
50 the median would of course have been shifted, but not 
very much, for the area, which would have been left out 
at the extreme right, when halved and distributed in the 
neighbourhood of the median would be found to have 
caused only a very slight displacement of it. That can 
be verified from Diagram lY. To take an example which 
can be supplied by the diagram, suppose you omit all 
those beyond the 800 per-cent, which gives those above 55, 
then the line through the 400 would give the median, wffiich a 
very rough measurement gives as 38 years. That is to say, 
the median has only been shifted fi.ve years by leaving out 
that immense number. If, instead of omitting these people 
over 55, the Census authorities had simply said, Here is a 
married woman, obviously old, we do not know her age,” and 
had entered her in that category, it would not have affected 
the median in the very least. The position of the extremes 
does not affect the median, only the number of instances. In 
the statistics with which I personally have to deal, often all 
that is known is this number. In this respect the median 
is very superior to the arithmetical average. The same applies 
to quartiles. If we do not know the exact posiiims of the 


**» r. I r >Fidr4 wr kii’jw the iputih^ If you dc‘i*ide two 
anfi th,-. iia-4iaio yuo liavo three |:ejiiit> on the ogive 
, po-iikins in the hi-togram, hum which the whole 

/:iL ''ftr'ii l;e eaiiHtnacted with fair aeenracy. The aritlmietic 
av^^raiTio siiiiyly the average/'’ gives the abscissa of the 
. f irravity of the group when plotted out as in 
lliairniiii III. The ariilnnetic average facilitates certain 
butj in iny experience, it is the least valuable 
K^t ihe iiiL'iiiis or averages which can be calculated; other 
peopIcT experience may be different. It is very liable to 
error. If a part of the group is accidentally omitted the 
averag’e is at once affected. If the numbers are correct and 
I lie position^ not very far out, yon would find by experiment 
iliat the aritlinietie average has not moved much ; but directly 
any n limbers are left our, tlie arithmetic average is disturbed. 
But the reason I distrust the arithmetic average and do not 
advocate its use is, chiefly because it renders such fallacious 
arguments possible. If you are comparing one group with 
another, after a little interval the arithmetic average may 
have remained quite stc^ady w’hen the group has changed 
eoiisiderabh", both the extremes having come in towards the 
iiieaii ; or it may shift when the group has not really changed 
its character, but only shifted its position a little. Any 
particular change of the arithmetic average may correspond 
to an infinite number of different kinds of change in the 
group ; and it is very often pointed out that a certain group 
has changed, that something has improved because the 
arithmetic average has changed; whereas it is only shifting 
the relative positions of two groups which are not connected 
ill reality.* If we have a perfectly homogeneous group, for 
instance, if mth wage statistics, we deal with a set of men 
doing similar work and earning similar wages, a change in the 
arithmetie average is significant ; hut if we are dealing with a 
composite group composed of skilled and unskilled tvorkmen^ 
two homogeneous groups merged into one, the arithmetic 
average might increase either by the higher group ascending 
a little while the lower group went down nearly as far, or the 
other way about; or by a combination of those two things. 

* Tli»e two sentences apply also to the median^ but the present unfamiliarity 
of the term will suggest caution in using it; while, as a matter of fact, the 

arillimetic aTcmge is used very carelessly. 



So the arithmetic average can never give definite 
and very often gives fallacious in format ion. 1 have not time, 
and perhaps it is not necessary, to dwell upon this point , and 
refer to the correction factors for Urban death rates. The 
necessity of that method illustrates my meaning in saying tliat 
before an arithmetic average is used, it is necessary to make 
sure that the group is homogeneous. 

The quartiles and the median not only give the definite 
position of the median, but also a measurement, which serves 
to show how the curve is dispersed from its central position. 
The distance between the two quartiles, 18*4 years in this case, 
shows to some extent how the curve is dispersed from its 
central point. That I shall return to in giving other 
measurements of this dispersion. 

If we were dealing with a group that did not give any 
such regular figure as this, a group to which the mode was 
certainly quite applicable, it would probably then be best not 
to attempt to draw any continuous curve at all, but to keep to 
such a diagram as that on page 5, and to calculate the deciles 
as accurately as possible. By making some simple assumptions 
as to continuity, it would be possible to calculate roughly the 
nine deciles, dividing the area into 10 equal parts, and enter 
them as a description of the group. I think that is the only 
method of satisfactorily representing an irregular group which 
cannot be divided into distinct homogeneous groups. 

CoMPAEisox OP Groups. 

The ogive diagram lends itself more readily than any other 
to the comparison of the two groups. I have selected two 
groups, which one might wish to compare, from the same 
Census table, the husbands whose wives were between 
45 and 50 years of age, and the wives whose husbands were 
between 40 and 45, which are represented by the lines 
LL and KK respectively ; and I have calculated, by one 
method or the other, the mode, the median and the quartile of 
those groups. Thus, for instance, from the curve K, of all the 
wives whose husbands were between 45 and 50 years of age, 
as many were less than 45*5 years as were more than that ; and 
similarly for the quartiles. The curves are very similar, the 
husband curve being four years to the right of the other. 
The method needs no further comment. 



Diagram 



Illustration of ush of thk Mkdialu 

I may take one example to iriii>irate tiie use tlie 
median. Tlie diagram on p. 18 represents tlie weekly wao“es, 
valuing everything that is paid in goods and not in linjiiey 
at an appropriate rate^ of three classes of laliourers in 
Englaiid^, namely Artisans in Prm’iiicial TowiiSj such as 
Birmingham^ Agricultural LaT}oiirers — the average for the 
whole of England — and Labourers in the same towns fiTtrii 
which the Artisans were selected. The figures are rather rough, 
and there is no material for making them exact ; but I think 
the lines drawn represent with fair accuracy the course of 
w’ages ; for if we once established the fact that all agricultural 
labourers are below the median, tve have simply to count tlieiii 
and not enquire about their wages. And so if %ve estabiisli 
the fact that any body of men is well above or well below the 
median, w'e have not to enquire into their wages, but simply 
to count them ; and to find the median we have only to 
investigate more carefully the body of men wdiose wages are 
near the median ; that is a comparatively easy task, because 
the body of men who are near to it are those whom w'e see 
any day in any ordinary industrial undertaking. The 
Census figures are bad for this purpose in 1902, and 
they 'were much worse in 1801 ; and there is a great 
deal of computation and guess-w’ork in determining the 
position of the median at any time through the century. 
But it can be done within certain limits of accuracy wdiere 
the task of determining the arithmetic average w'ould 
be hopeless. When we have determined the median and trace 
out the positions for 110 years ^ye have a much more 
interesting and exact piece of information than if we had 
made use of the arithmetic average. We have the weige of 
that man who is half way up the skilled wage earners ; but if 
we give the arithmetic average it will carry us no further ; 
it is simply a numerical quotient. The line in the diagram is 
drawn through the estimated positions of the median for all 
male adult wage earners in the United Kingdom, at selected 
dates. These figures are rough, and should not be quoted 
without verification. The only ones calculated are those with 
a dot or cross in the figure ; intermediate lines are interpolated. 
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Hough dittgram illiiKtratiiig tliu uho of the Mi*diun. 


MEASUEEilEXT OF GEOUPS. 


SECOND LECTURE. 


The Standard Deviation and the Modclcs. 

The methods I have employed so far for determining- ihe 
median and the mode, together with the ordinary method of 
determining the arithmetic average, together also mth the 
quartiles and deciles, give a series of definite quantities 
connected -with the curve. Each of these quantities — the 
mode and the median — performs the function of an average ; 
that is to say, that nnmber by itself gives briefly one of the 
most important positions, one of the most important 
characteristics of the whole curve. But no one of these 
quantities gives sufficient information to enable us to 
reconstruct the curve or to describe it completely. It is 
true that if we have given the nine deciles, including the 
median, we have nine points on a continuous curve, and in 
general it is possible to construct it with reasonable accuracy. 
But if we only have the mode, or only the median, we 
have not enough to construct the curve. My object, then, 
is to develop one or more methods of calculating other 
quantities related to the group, which will enable us to 
complete or amend the description of the group, as given 
simply by one of the averages. 

We will always suppose that the group is described in 
relation to a horizontal axis OX, and may be of any nature 
about the axis. What we have found so far in the median or the 
mode is one point on that group, one position on that axis — in 
the case of the mode the position under the highest point — 
in the case of the median the position, the line through which 
divides the curve into two equal areas — ^in the case of the 
average it is the abscissa of the centre of gravity. I have 
now to find a second quantity which will enable us to describe 
or determine the shape of the curve when you are given this 
one position on it. The method I am going to take is 
independent of any assumed shape of the curve, and it is 

G 2 



tA botli the i: roups to which I referred on page 2^ 
the err-up ivliieli is supp^-^sed to he an aeeii,rate representation 
ni thi facts, and that which represents only samples of a 
larger irr^up wl:r:j<e observation is not completedy made. I 
have nr-t to deseribe the well-known method of calculating 
the fleviaii^n^ from the average, and then to pass on to find 
the avrratre deviation, the average square of the deviation, 
ant! tlie average cube of deviation- Let there be n observations 
represented by ; let x be the abscissa of the 

centre of gravity ; then i is the average of the group, the sum 
of the .rhs divided by ?i, their number. From each of the tifs 
subtract the abscissa of the centre of gratfity; thus xi — x^ 
-J 2 — i, . , . Those are the deviations of the observa- 

tions from their average. In some connections they are called 
the errors from the average, but I shall adopt the ^vord 
“'deviation*^ in every case. In the first place it is to be 
iiuiieed that the sum of the deviations is necessarily zero ; for 
2i {x — i ) = S iX —nx=0. 

The sum of the squares of the deviations is and 

is the mean square of the deviations, which is 

otherwise called the second moment of the deviations, about 
the origin in this case. The word moment is from a 
dynamical analogy ; it is used in this connection by Professor 
Karl Pearson. 

The following notation is adopted. The moments 
measured about the origin are written g/, about the 

centre of gravity /ij, /i., - . so that 






and 




/u= h(x-ir-= ^ 2 **- 

The quantities we need are not the moments about an 
arbitral origin, but the moments about the centre of gravity. 
But it is far easier to calculate the moments about an 
arbitrary origin than to obtain those about the centre of gravity 

by the above formulae. 



Diagbam VII. 

Heights of Persons. 

Observations and Curve of Error. 
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xy 

Xlf 

Xlf 
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Indies 


Instances 


Inches 


Instances 


57-58 

1 

0 

0 

0 

67—68 

321 

3,210 

32,100 

321,000 

58 - 59 

1 

1 

1 

1 

68—69 

245 

2,695 

29,f>45 

326,095 

59—60 

6 

12 

24 

48 

6i»— 70 

213 

2,556 

30,672 

368,064 

60—61 

4 

12 

36 

108 

70—71 

152 

1,976 

25,688 

333,944 

: 61—62 

1 I-'* 

60 

240 

960 

71—72 , 

1 88 

1,232 

17,248 

241,472 

i 62— 6;i 

; 39 

196 

975 

4,875 

72—73 : 

1 55 

825 

12,375 

185,625 

1 63-64 

1 74 

444 

2,664 

I 15,984 

7^5 74 I 

t 26 

416 

6,656 

106,41^ 

! 64—65 

153 

1,071 

7,497 

! 52.479 

74—75 1 

9 

153 

2,601 

44,217 

1 65—66 

' 243 

1,944 : 

15,552 

! 1*24,416 

75—76 i 

1 1 

18 1 

324 i 

5,832 

; 66 — 67 i 

288 

2,592 ' 

: 23,328 : 

i 209,952 

76—77 

1 1 

19 j 

361 i 

1 6,859 


824 1 

6,331 : 



Total .. ! 

1 1,935 

19,431 j 

207,987 j 

2,348,427 


2 = 1,935. 
f = 19,431. 

e= =10*0419. Average 67‘64 inches. 
-=^, 987 . 

= 107*487 = 

= 1,213*66 =/s. 

rred to average — 

=^'s~^= 6*647 (2nd moment), 
j =jLt^ — |V= 6*564 (2nd moment corrected). 

By parabolic interpolation — Mode, 


or (standard deviation) = =2*562 

c (modulus) = V 2^2=3*623 inches. 

/i3 = fx\ - dfi\x + 2P = 1,213*66 - 3,238’12 
+ 2,025*24 ==*78 (3rd mome 

j _ ^ .Qjg (slcewness from moments) 

3 = + *06 (skewness from obsen^ations). 

7} (mean deviation from average) = 2*01 (ii 

Median = average — | 3 ’c= 67*47 (inches . 

Mode = average — j c = 67*32 (inches) . 

^*303 inches. Median, 67*566 inches 


It m-ill now be eonrement to follow tlie figures on tlie 
taiijie aiiil diai^raiu ailjoiiimg. The figures are taken from the 
re|r-it tlie Antliropoiiietrical Committee of the British 
in They are selected merely as being a 

eoiiYeiiiriit oTOup by which to explain the calculation of these 
moinenis. The heights of 1,935 persons were given as between 
certain inches, betwcwm 57 and 58 inches, as under the column 
headeii ...r + oTi. I take the origin at 67 { inches, and the 
absei-sie for the successive groups are 1, 2, 3 ... . 20. The 
rmiiiber of instances io these various groups are those given 
in the second coluimi, under the letter y ; one person under 
58 inches, one between 58 and 59 inches, and so on. The 
instances in this case occur in groups, and we are not able to 
separate thein by means of the data, hence each deviation will 
occur in most eases more than once. Thus, a deviation shot^m 
between 64 and 65 inches occurs 153 times. Instead of adding 
the simply to obtain the deviation w^e multiply each 
deviation by the number y, the number of times it occurs, and 
so obtain the third column xy, whose sum is 19,431, which is 
the first moment about the origin. The sum of the deviations 
is to be divided by 1,935, the total number of detfiations, to 
give the first moment, namely, 10*042, and this gives the 
position of the centre of gravity measured from the origin, 
57| inches. The columns under xy- and xy^ require no 
explanation. The totals 207,987 and 2,348,427 are divided 
by n, giving 107%5 and 1,213, ft ^ and in the notation adopted. 

It now remains to reduce these moments about the origin to 
the moments about the centre of gravity, by means of the 
formulae given above. 

The practical simplicity of evaluating the moments by this 
method arises from the fact that we are dealing in the x’s 
with a series of numbers ascending in uniform order, 1 to 20, 
and that the whole arithmetic computation is very simple and 
very easily checked, whereas if we proceed on the direct 
method of writing down the position of the centre of gravity, 
which will naturally not be an exact number, each of the 
deviations will introduce as many decimal places as are kept 
in our calculation ; and the squaring and cubing will be 
very arduous, and we have no ready means of checking our 
results. It is therefore worth while to take the formula and 
ch<x>se our origin so as to give the least arithmetic work and 
obtain the second and third moments indirectly. There is a 
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small correction to be made for the moments so calenlate^d for 
the second, fonrtli, and other even moments. I will deal only 
with the second- It is to be observed that in the whole 
calculation it is assumed that all the persons in a particular 
group are exactly at the middle of that group, e.g.^ that 
153 persons in tlie 64 to 65 inches have the height exactly 
64| inches. It is obvious that that will not be the case, and 
it is easily seen that that wull introduce a definite error in the 
calculation of the second moment. For if we take one of 
these groups in particular, and make the assumption that the 
whole number lies at its middle point, we are representing it 
by a rectangle instead of by a trapezium with the side nearer 
the centre of the group longer than the other; a little 
consideration will sho%v that that makes the second moment 
too great. Mr. Sheppard has sho%vn that under certain 
circumstances it will be sufficient correction to subtract the 
fraction ^ from the second moment calculated on the 
assumption of uniform distribution at the middle points of 
the groups to obtain a moment in a true approximation. 
On page 21 the corrected moment, is 6'564, while the 
uncorrected moment, fju>, is 6*647. The correction will be 
^ only, if the difference between successive groups is one 
unit of abscissa; if the difference was h, we should have 
to multiply -^hj h^; but for practical ^vork it is best to take 
the unit as the distance between groups which we are dealing 
with, and hence the correction is in the form which is 
of practical use. 

The standard deviation is defined as the square root of 
the second moment about the centre of gravity. Professor 
Karl Pearson used cr to denote it, and cr is 2*562 inches in this 
case. It is sometimes more convenient to deal with the square 
root of twdce the moment, which is called the modulus, and 
denoted by the letter c. Professor Edgeworth uses the modulus, 
whereas Professor Karl Pearson uses the deviation. We shall 
see the appropriateness of the modulus when we deal with the 
curve of error. The modulus for this group is 3*623 inches. 
It is a very remarkable fact that the modulus for the height 
of groups of men is almost universally very nearly 3*6 inches. 
Professor Bdgworth gives a list of 10 such groups in the 
Jubilee volume of the Journal of the Royal Statistical Society : 
the moduli are 3*6 (United Kingdom), 3*6 (England), 3*4 
(Scotland), 3*6, 3*7, 3*8 (United States), 3*7 (Belgium), 



Itiih'.. I iiirrely call attention to that in passing, to 
an idtni tliai ihv liiodulus is of real sigiiificaiice and 
a mere arilhiootieal ealoulation. 

XvLiixGi: Deviatiox. 

For file next few minutes I propose to assume ttat the 
curve I am dealing is symiiietrieal about its centre of 
gravit j. The eiirve of heights which is sketcdied on page 21 
is in fact very nearly symmetrical. If the curve is actually 
sTiiiiiii^trieal all the odd moments are easily seen to be zero, 
while the even moments are not. Then this quantity a or c, 
wliicliever we adopt, serves to measure the distance of the 
eiirve from its average, to use a clumsy phrase, or the 
dis|jersion about the average. Before discussing the 
appropriateness of this measurement I have to explain two 
simpler methods of measuring the same thing. One based on 
the first power of the deviations, and the other based on the 
distance between the quart iles. First for the average or mean 
deviation which, in the notation I am using, is called rj. 
If we write down the deviations in the method just defined 
and add them up, tve obtain zero ; but if we treat all the 
deviations as positive and add up their absolute values we 
do not obtain zero. The calculation is as follotvs : — Treat the 
negative deviations and the positive deviations separately. 
The sum of the negative deviations is 

S!/{10*0419-~^) = 10*0419x 824-6331, 
from the figures to the left in the table on page 21. The 
sum of the positive deviations is 

:iy{x - 10-0419) = 13100- 10-0419 x 1 1 1 1, 
from the numbers in the right compartment. The average 
deviation is, therefore, 

; 1310(1 - 10-041 9(1 1 1 1 -824)-6331 }-4-1935=2-01 (inches). 

Probable Error. 

The other simple method is based on the quartdes. 
Calculate the quartiles of this group by any of the methods 
already given, and you will find them to be approximately 
65-8 inches and 69*3 inches. Since the median as given on 
page 21 is 67*566, one quartile is 1*78 inches below the 
median, and the other 1*72 inches above it. Half the distance 
between the quartiles is called the probable error. It is a 
term which is so firmly in use that there is no hope of 



improving it, but it i.^ one of tlie most erroneous terms in use 
in iiiarliematies. Half r lie distance is 1*75 indies. If we take 
a person at random from this group and measure his lieitflit, 
it is as likely as not the height will be found to i)e between 
tlie quartiles, for the space contained between the ordinates 
at the quartiles is exactly half the whole curve, hence the 
phrase probable error/^ 

If we were dealing with the special distribution determined 
by the e€|uatioii to the curve of error (see p. we should 
have the following relations : average deviation = modulus 
-s- d' 57 ^ and probable error = modulus x *4769. Iliese relations 
are approximately true for this distribution of heights, for the 
values of the mean deviation found from these equations w’lien 
the modulus is 8*623 inches are 2*04 and 1*78 inches 
respectively, while the numbers found above are 2*01 and 
1*75. 

These methods of describing groups are, ho^vever, 
applicable to groups ^vdch do not confonn, even approximately, 
to the law of error. I shall now treat them without the 
assumption that they do conform. The probable error is the 
measure of dispersion, which is most quickly calculated. We 
can WTite down the quartiles very rapidly, and take half their 
difference at once. But that only takes into account the 
positions of the two quartiles, and does not take into account 
the positions of the extremes, but only their size, and, depend- 
ing as it does only on two quantities, is liable to a large 
amount of accidental error. The mean deviation on the other 
hand takes into account the position as well as the number of 
all the C|uantities, and is therefore less liable to accidental 
error, and also it does not take at all long to calculate w*itli 
simple numbers. The modulus and the standard deviation, 
again, take into account every observation, but they give extra 
weight to those which are a great distance from the average. 
In some cases that is right ; in others it is not. If we are 
basing arguments as to the group and the shape of the group 
on probability, then very likely it ^vill he correct to give this 
extra weight to an object which is far from the average, for 
the farther from the average the less the probability, and in 
some cases the probability diminishes very rapidly as we move 
from the average. If we are not going to make assumptions 
about the shape of the curve, nor apply the principles of 
probability, I do not know that we shall find any justification 



lakiiVL? th^ ^r|iiare, rather than the meaiij deviatioTi. As a 
rule we may say that we pass appropriately from the 
|i!r»liaioe eiTMi* to the mean error, and from the mean error 
!iit‘ .-iai'iclard error as: the curves tvith which w’e are dealing 
liceuiiie iiewe ilefiiiite and perfectly coiitiimons, and 
appri^xiiiiate iii^^re and more nearly to a curve vdth a definite 
alL»'ehraic For very rough measurements which are 

not eoiitiiiiioiis and wdiich are not to be corrected, the probable 
error, measured as half the distance between the quantiles, 
will vory likely be the best measurement. As the curve 
atiaiiis a definite shape, and as tve are able to treat the 
observations as more and more continuous, it will be well to 
take the iiieaii error, and finally, if we have a perfect 
algebraic curve, then very likely it wfill be most correct to 
take the standard deviation.^^ 

Measurement of Skewness. 

Xow to pass on to unsjmmetrical curves. We have 
obtained by one of the averages the position of the curve and 
by one of these measures of dispersion one measure of its 
shape. We shall notv obtain the measure of its want of 
symmetry, or briefly, of its skewness. Most curves have some 
degree of skewness ; but in some cases it is negligible. 

As an example of a curve with considerable skewness, we 
may take Diagram III, on p. 6. The curve is elongated to 
the right ; the mode is to the left, the centre of gravity to the 
right of the median. This is the general order of these three 
averages. If a skew curve is formed by stretching a 
symmetrical curve to the riglit, the stretching shifts the centre 
of gravity, relatively to the median ; or, from another point of 
view, if a curve is heaped up to the left and stretched to the 
right, experiment will show that the line through the median 
is to the right of the highest point. 

There are very many possible ways of measuring this 
skewness. One obvious measurement is simply the distance 
of the centre of gravity from the median. Another is to use 
the quartiles. Call the positions of the quartiles, Qi, Qg, the 
position of the median, 0, of the mode, M, and of the centre of 
gravity, G. In a symmetrical curve the distance Q^O is equal 
to the distance OQi, whereas in a skew curve it will not be. 
In a skew curve stretching to the right, the upper quartile 
to the right is further from the median than the lower 



quartilej and tlie difference between these twc-^ inea>iirr< will 
form another iiieaiis of estimating its skewness. The iliii-i! 
method is to take the first power of the deviations^ and compare 
the excess on one side the centre of gravity with the defect 
on the other. The fourth method is to take the third pi^wer 
of the deviations and consider its absolute magnitude. All 
these methods have their uses. I propose to deal with three 
of them. I will first take that wiiicli is arithmetically the 
simplest. The simplest measurement^ that which you can 
calculate almost instantly, is the difference between the 
distances from the quart iles to the median. But that gives a 
concrete quantity ; in the case before us so many inches ; 
whereas it is convenient to measure the skewness as an 
absolute quantity, on a scale from + 1 to — 1 ; and we must 
therefore reduce this concrete quantity to an absolute 
quantity. The proper method of doing that is to divide it by 
the modulus, which is a concrete quantity, in this case so 
many inches. The one divided by the other gives an absolute 
measurement, Avhich w’ould serve to measure the skewmess. 
But it is better to multiply that measurement by the constant 
3*29 (see p. 36, helo^v) before using it, to bring it into 
conformity wnth the theory of probability ; in the same sort of 
%vay as the multiplication of the second moment by 2 to get 
the modulus brings the standard deviation into conformity 
with the methods of probability. 

Another and yet simpler method of measuring almost 
exactly the same quantity, is to divide the difference between 
those two quantities by their sum, that is to say by twice the 
probable error; then if we multiply that by 3*14 (see p. 36, 
below), we shall obtain the same measurement very nearly 
as before. This method supplies a good rough measurement 
which is very rapidly calculated ; we write down the median 
and the two quartiles, calculating them roughly or by one of 
the more complete methods given above, and at once write 
down the probable error ; by this means the skewness of the 
group can be calculated in fi.ve minutes. But this measurement 
depends on the positions of three points only, which are 
subject to accidental errors, and the parts outside the quartiles 
have not much influence on the result. 

A measure, which is influenced by all the items, is obtained 
by taking the third moment about the centre of gravity ; this 
in itself is a measure of the skewness, but it is not of the 



for It is a concrete quantity of tlie order of 
:i The «leviati*.sii< liave been cubed : to reduce it to an 

t|uuiitiiy it mii-r be divided by rh Calling this 
ou'ii-iiiv j, ;vi‘ have j = wliieli in the group given on 

]). il i- e- Ilia I p) 

ill a curve wiiicii i- nearly symmetrical and approximates 
!•> tile riirve <4* error, the distance between the arithmetic 
average and the median will be ^ jc, and the distance between 
the arithmetic averaire and the mode will be jr, and these 
rektiuii- ,sur)p]y a third method of e>rimating the skewness. 


Diageam VIII. 

Dr?//?/ Wtff^es of Belgian Coahjniners. 




29 


First estimate the modulus, and then calculate the positiori 
either mode or median and the aritlimetie avera_L^t% divide the 
distance by c or by c, and we obtain j. Bur rliat is not an 
accurate method, if we use the mode, which cannot be 
precisely determined ; while if vre use the median, we are 
depending upon a single position. The formula to be 

preferred are j= xS’lT, and j = the former 

perhaps when the curve is not approximately the curve of 
error. 


The adjoining Diagrams illustrate the practical use of tlie 
technical quantities -which I have now discussed. In 1896 
the Belgian Government undertook an Industrial Census, and, 
amongst other things, they collected figures of the wages of 
most of the w’orkpeople of Belgium. We have here in 
graphic form the daily wages of the Belgian coal miners in 
1896. A supplementary enquiry was conducted in 1900 over 
nearly the same area, and the result is given just belown The 
methods we have developed give us a rapid means of 
comparing the results of those two enquiries. It is the 
rectangular figures only with w'hich we have to deal at 
present. The average increased from 3*68 francs to 
5*36 francs between the dates; the modulus from T20 to 
2*047 francs, the skewness changed from a negative skewness 
of — *10 to a positive one of *22. Those three statements 
rightly understood and interpreted give in a brief form the 
result of the Census. The average has increased, more money 
went in wages, and the modulus and standard deviation has 
increased very much. There wms a development, therefore, 
of wages away from the average, either by highly skilled 
workers increasing their wages greatly, or by a body of 
unskilled workers coming into existence. If you look at the 
curve you will see the dispersion is chiefly increased to the 
right, and that increased standard deviation is due either to 
the inclusion of a higher grade of workmen than had been 
included before, or to the fact that the higher grades of work 
had obtained a great increase of wages. I am inclined to 
think it possible that the increase of dispersion is partly due 
to the erroneous inclusion of people in the second enquiry 
which were not included in the first, but I have no means of 
going behind the figures. The change of j comes from the 
same sort of reason, that a body of skilled workmen were 



■•btaiiiiim liiirlier wages, or that the number of skilled 
wi^rkmvu liaii increased. Either of these means would 
mereii^v j in u positive tlireeti^m. This use of the letters 
iiKiy be left c^.»!i>idenitiM!i. 


Retiiniiiig for a moment to the use of deidations in 
CMinieeiiciii with the median and arithmetic average^, I have 
t*i poiiit out the eiirioiis relation betw’eeii the two. The 
arilliiiietic average is that quantity from which the sum of 
llie cieviaticiBs is nothing, and the sum of the squares of the 
ileviiitioiis the least possible. The second result is obtained 
instantly from the formula already given, The 

SII. 1 I 1 of the squares of the de\nations from the arithmetic 
averaire is fJh' the sum of the squares from some other origin 
and fr^jiii that formula always less tlian /jJ, The 
iiieiliaii on the other band makes the sum of the first powers 
of the deviatiuiis a minimum, and the sum of the zero potvers 
Zero. If we take the zero power of the deviations, each 
deviation is replaced simply by 1, and then from the definition 
of the median we find the sum of the zero powers measured 
from the median is zero. That the sum of the first powers is 
a minimum can be readily demonstrated, most easily by an 
analogy. Suppose that it is required to run from a telephone 
exchange separate wires to everyone of n places in a straight 
line, where should the exchange be placed, so as to use the 
least total amount of wire ? At the median position. For if 
you move from the median position to the right or to the left 
you will find immediately that you are adding more wire than 
you are subtracting. Supposing there are 20 stations, and 
you have a position between the lOth and llth; if you move 
to a position between the 11th and 12th, you have to increase 
your distance from 10 stations and diminish it from 9, in every 
case by the same length of the wire. The wires correspond 
to the deviations ; and the sum of lengths of the wires is the 
sum of the lengths of the deviations. Consideration of this 
illustration will show ' that the sum of the deviations is a 
minimum when they are measured from the median, but that 
the median is not quite determinate, for if there are an even 
number of stations the sums of the deviations measured from 
all jM>ints between the two central stations are the same. 



MEASUREMENT OF GROUPS. 


THIRD LECTURE. 


The Cueve op Error. 

The subject discussed in this section is full of technical 
difficulties, and it %vill be impossible to cover the subject 
adequately in the short space allotted to it. It must then be 
regarded as containing rather a summary of those important 
points connected with the theory of error, which I shall have 
to use subsequently. While making it as complete as possible 
in itself, in several cases I shall have to ask acceptance 
without proof of results which I shall find it necessary to use 
at a future date. 

Among the various shapes assumed by groups of observations 
of any kind which are (as in the groups already taken) 
grouped in a more or less regular way about the central line, 
there is one distribution of the various deviations about 
their centre which is regarded as normal, and the curve 
representing it is called the curve of error. And it is the 
deduction of the equation of that distribution which I have 
first to deal with. After we have the equation we will discuss 
to what extent the normal curve is actually found in the kind 
of statistics with which we deal. The normal curve can be 
obtained from the statistics found in games of chance, or 
from the statistics which may be obtained by counting the 
occurrence of specified digits in mathematical tables, or from 
anthropometric measurements, or again from some groups of 
social statistics and from some groups of vital statistics. The 
deduction of the equation I am going to take is the only one 
which I think lends itself to purely algebraic treatment. 
Other deductions depend upon the use of diSerential calculus 
or even of the theory of functions. 

Let us consider some occurrence for which the chance 
is jpy the chance against so that ^ + g = 1 . Let us suppose 



:^2 

!!:«• rveiiT which may or may not give the occurrence takes 
i/a*,- !iiiii‘- airain and again, and that in each n times we 
c :ii.! iit'AT s.ftmi "uecos is <d;itaiiied. For instance, suppose 
u-f ihtcli a a limes and eoiiiit how many heads are found 
and thrci repeat ilie n-fold experiment again and again and 
in each cose the number of heads, that would give a 
M/rie- t.d the kind I have in mind. For a small number of 
ex])iTi!iit*ii!>, if eacli set of experiments contained 16 tries or 
any Nina 11 finite luimber, it is easy to set down the probabilities 
of tile vari^.ms luiiiibers of successes. And it is also clear as 
s<'N,.ci as rlie algebra of the method is tackled, that there is a 
limit towarijs which these chances tend as the number of 
expeririieiiis in each gi'oup is indefinitely increased. What we 
luive to do first is to find the limit towards vehich such a series 
v! experiineiirs rends when the is increased indefinitely. 



llic diagram annexed represents the various chances of 
the numbers of heads in the experiments of pitching a coin 
12 limes. The most probable number of heads is of course six, 
the least probable none, or 12, and the probability of 0, 1, 2, 
up to six, is continually increasing. If w'e erect 13 ordinates 
representing the probability of no heads, one head, and so on 
up to 12 heads, we get the diagram marked + If we 

take another kind of experiment where the chances for success 
and failure are not equal, e.g., where the chance of success 
is *3, and perform the experiment 10 times, we get the 
probabilities of one, two, and so on up to 10 successes 
represented by the following diagram : — 


(■3 + -7)i« 
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The first curve is of course symmetrical, the second curve 
unsjnimetrical. What we have to do is to deduce the 
shape of the curve when the index is infinite, whether the 
chance in favour is one-half, or whether the chances for and 
against are unequal. 

If p is the probability of an event, and p -1-5 = 1, then the 
probability of m successes in n trials is - ^ and 


m n — m 


successive values of m give the terms of the binomiai 
expansion + 

Assume that np is integral. Let 7 ip=r, nq=s, r + 8=n. 
Denote successive terms by Hq, Ui , . . 

Then Us, which is the greatest term, = 

. to x—1 factors 




^1 + . . . a; factors 

log «..+*= log Us + log ^1 - + log ^ . 

+ log(l- -’^)-log(l + J)-log(^] + . . . 

-log (1+1) 


=logttg- 


1 +2+ ■ ■ ■ +X-1 _ P + 2"+ ■ ■ ■ +.-e-P 
r 2r^ 

1 + 2 + . .. +x , P + 22+ . . . +P 


+ 


2P 


&c. 


= \ogUs 


= logUs 


Let X 2pqn=z^c‘ 


x(x—l) 

x{x + l) 

1 

H 

1 

1 

2r 

2s 

12P 

, (a5 + l)a(2.B+l) 
+ 12P 

x^(r-hs) 

x{s-r) 


2rs 

2rs 

or-s^ 

p 

x{q-p) I 


2pqn 

2pqn 

efqhl,^ 


log »,„=l«g 5|£=1> + 4c. 

V zpqrb 6 V Zpqn 
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The assumption that np is integral made above does not 

alfeet the limiting form of the equation. 

It is at this point necessary to consider which terms are to 
bt? rejected, when n is made infinite. If x is finite^ if we move 
through oiilj a finite number of terms from the greatest 
wrcliiiatto the ordinate equals the ordinate iig. This part 
uf the curve approximates to a horizontal straight line. To 
take a iiiimerieal instance, the chance of obtaining 499 heads 
in tosses is practically equal to that of obtaining 500 

heads. On the other hand if is infinite, it appears that 
is zero. If the figure is drawn so as to show finite values 
of X we obtain a horizontal straight line ; but if an attempt is 
made to include infinite values of x, the curve becomes the axis 
of X and a finite vertical line through the origin. 

But it becomes clear, if we examine the shape for different 
finite values of ?i, that the carve has a definite shape and finite 
curvature near the centre. Before we go further let us take 
an analogy. If we take an hyperbola and try to include the 
whole curve in our figure the curve will coincide with its 
asymptotes. In order to draw the curve so that the part 
between the asymptotes and the vertex can be seen, we must 
adopt a particular scale so as to obtain the length from the 
vertex to the centre as a finite quantity. Again, if we pass 
from the ellipse to the parabola by the process of pushing the 
centre to infinity you have, in order to obtain the finite part of 
the parabola at all, to make the hypothesis that y^x is finite. 
In order to get the finite part of the curve of error we shall 
have to select that part where the ratio of to n is finite. 
Then it will be found that we shall obtain the part of the 
curve that has a definite curvature and a definite shape in a 

finite form. Let us assume, then, that — is finite ; and let us 

n 

substitute for — the quantity z- with the factor 2pq. The reason 

for that factor will soon be obvious. Take <^=2pqny so that 
x=^zc. We then obtain the equation log% 4 .ay=log%— 
when all vanishing terms are neglected. If the above 
deduction is carefully examined it will be found that all the 
tenns omitted are infinitesimal in comparison with those 
retained, when n is infinite. 

Removing logarithms, and writing y for %+a., we have 

^ =Us,e ^ . 
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We are still at liberty to choose a scale for the ordinates^ 
and it is most conTenient to choose that which makes the greatest 

ordinate =— for then the area bounded bv the curve, and 

C v^TT 

the axis of x becomes unity; then each part of the area represents 
the probability of certain occurrences, for the whole curve 
represents 1, which stands for certainty. An alternative is to 


take the ordinate as —7= so that the area of the curve is X, 

^VTT 

where X is the number of experiments. Then the area standing 
on any part of the axis represents the most probable number 
of events corresponding to that part. 

Xow let us go back and take the terms we have so far 
rejected, which involve I-— Each of these contains the 


factor. 




It is convenient to call that quantity 2j, 


\/ 2pq7i * 

for then we shall find that j has the meaning already assigned 
to it (see p. 21, and for proof see p. 36). 

Re-writing the equation with that notation, and then 
expanding the part which contains j and neglecting the 
powers of j, we have 





C V TT 



It is easily seen that jc = — q) =p — | = J — The centre 
of gravity of that curve can be shown to be at the origin by 
integration. The area of the curve is of course the integral 
of ydx, taken between plus infinity and minus infinity. The 
part of the integral which does not contain j is a well-known 
definite integral, which equals unity. It can be seen that the 
part containing only odd powers of x does not affect the 
definite integral. Hence the area is unity. 

Now let us calculate the error of mean square of the curve 
from the equation. It is obtained by multiplying the element 
of area da? by its distance (x) from the centre of gravity, 
and ad^ng up all the parts so obtained, and then dividing by 
the whole area, i.e., unity. 

It is easily seen that the j term does not enter into the 

1 . . T 

result, which is therefore ya!^,dx=-^c% by integration by 
parts. Comparing this with p. 21, we see that c, thus 

B 2 
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raxukicil, is the modulus, as there defined. The third 

'—x 

M ijifiii is S divided by the area, which is unity; 

bv parts we obtain that the skewness, as defined 
vii p, i> eijiial to the j in tliis eqoation. The constants 
in the m the cur’re of error, as written above, are 

then the iiikidiihis and skewness as defined for curves in 
general. The avt^rage deviation, as defined on p. 24, is found 

lij iiitegniiiiig j i/..r.d;r, to be and does 

not involve j. 

The equation in its integral form is, if stands for 

area on abscissa from origin to .r 



the lower sign being taken when a’ is negative. 

This does not admit of any simple evaluation, hut it has 
been tabulated for a wide range of values of a;*. From these 
tables it is found that the probable error for the 
symmetrical curve (where j is zero) is c x .4769, which is written 
pe. For the im symmetrical curve the distances between the 

2 

median and the quartiles can be shown to be pc+ 

while the distance between the centre of gravity and mode is 
jcf, and between the centre of gra\fity and the median is 

I . jc as used on pp. 27-29 above, where the resulting numerical 

values are given. 

The effect on the curve of the 3 term is to stretch the 
carve to the right, heaping it on the left at the same time, the 
sort of figure which is indicated in the second diagram on p. 32. 
Actual examples of the curve for different values of c and j 

are given on pp. 21, 28. 

The tables give the integral for the argument ? , not for 
1’, and before they can be used the observations must be 

* See Burgem^s Mathematical Tables; MerHmam^s Least Squares, p. 186; 
Bimmnts of Statistics, p. 281, and p. 332 (2nd Edition); and Journal 

of the Mogal Statistical Soeietg, 

t See Biememts of Siatisties, p. 331. Hence, OQg ~OQi= whicli 

giTes results on p. 27 and p. 29. 



reduced to the centre of gravity as origin and e as unit. 
Then if we find in the table that the integral function, is 
*455 when the argument = +1-387*, we are to understand 
that *455 of the whole area stands on the axis of .r between 0 
and 1*387 of the modulus. The tabular statement then shows 
the various fractions of the whole observations which may be 
expected (in an infinite number of experiments) to lie bet^ween 
the most probable value and various values with an assigned 
deviation from the centre. Thus vfith the spiimetrical curve 
of error, one-quarter of the observations may be expected to 
be above the most probable value by not more than *47 of the 
modulus, one- third by not more than *68 of the modulus ; all 
but 2 per 1000 are separated by less than 2*2 of the modulus 
from the most probable value ; the chance of a deviation of 
5 times the modulus is less than 1 in a billion. 

Supposing we are given a set of observations which we 
have reason to suppose should arise from the distribution 
defined by the symmetrical curve of error, what particular 
curve of error are we to fit to our observations ? The problem 
is not very impoi*tant in itself, but the method of solution is 
very similar to the method which underlies the principle of 
least squares and of several other formulge. The only things 
which we have a possibility of choosing are the abscissa of the 
centre of gravity and the modulus. 

Let be the deviations of the observations 

measured from their average. The separate chances that these 
should arise if the equation of distribution is 

1 1 _ 

2/= — j-e, are — 

C V TT C VTT 

where r is given successive values 1, 2 . . . n. 

The chance that should occur together in a given group is, 
by multiplication, 

G~^.7r 2 (say). 

Now on what principle are we to find out the values of 
c and hi Of all the curves of error from which these 
observations may be supposed to bave arisen there is one curve 
from which they would arise with the least improbability; to 
find tbis we have to make P a maximum, h and c are quite 
independent. Then the differentials of P with regard to h 
* Whicli is the case when j = + *073. 
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and f mii>t each be zero. The first gives that h is zero* and 
tlie centre of gravity of the observations is the origin. The 


seeoTii shows that 


72 


is the mean sqnare of the ^^s.t So 


that to choose the normal curve which fits the observations 
best, in the sense that they w^ould have, arisen from that 
distribution with the least improbability, 'vve must take for the 
centre of the curve the centre of gravity of the observation, 
and for the modulus the error of the mean square multiplied 
by 72 d 


It will be noticed in the proof that in a sense there is 
only one symmetrical curve of error. We can reduce any 
curve to the form y=:e~-% by suitable choice of scales for the 
co-ordinates ; hut if w’e are taking two groups measured in 
the same unit, for instance, both in inches, or shillings, or 
years, then the x axis has concrete units, the unit distance 
stands at one inch, one shilling, one year. And if we take 
two separate curves both measured in inches, work with the 
same unit of abscissa, and make the areas each unity, we do 
not get the same maximum ordinate. The finite part of the 
curve with the lower maximum ordinate stretches further to 
the right and left than the corresponding part of the other. 
As long as we deal with concrete quantities we shall find 
that the quantity c enters into the shape of the curve ; and 
the comparison of any two curves is made by means of the 
values of c given in terms of the unit of abscissa. The quantity 
j is independent of all concrete quantities, and is an absolute 
measure of skewmess, as already pointed out. 


* =( 23®i‘).P= — 23®-&’.P=-0 when k is 0. 

Oii 

^ ^ ~( e ^ — ^.P=Q, when -r n, since k is 0. 
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80 


We 'will now use tlie heiglit-statistics given on p. 21 as 
in example of the method of comparing a set of observations 
vith the curve of error. In the first place we take the centre 
)f gravity as the origin, namely: — 67*54 inches. The modulus, 
oj the method of moments is 8*623 inches, which is therefore 
}0 be taken as the unit. Thus 59 inches is 8*542 inches below 
}he average, that is, 2*357 times the modulus. The latter 
lumber is entered under t in the second column. All the 
>thers are calculated in the same way. Then turning to the 
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tiilile.' anti finding wkit integral corresponds to tlie assigned 
valiies of T, in the symmetrical curve of error^ we write tiiem 
under the heading F{t). So we have that between the 
average and 59 inches *5 of the whole curve is obtained^ that 
is to say, one-half; in the next line^ between average and 
60 ii'iclies 49S of the curve is obtained^ and so on all the way 
d'owii to between the average and 76 inches, when again half 
tlie curve is obtained, correct to the third decimal place. We 
should not get the the true half till we have gone to infinity, 
bat the area of the curve beyond does not amount to one per 
iiiille of the whole. In this curve, for example, “462 is the 
probability that the height of a person chosen at random lies 
between 67’54 inches and 63 inches, for *462 is opposite 
63 inches. The fraction of the curve is the same as the 
probability of the occurrence between the point given and 
the average. 

The next column, called Y(t), is obtained in a similar 
way from tables including the term involving j ; the value 
of j is taken to be -f *06 for reasons given below. The column 
following under calculated ” consists of the differences of 
the Y{t) column multiplied by 1,000; the numbers so obtained 
are the numbers to be expected approximately between 59 
and 60 inches, 60 and 61 inches, &c. The following column 
actual gives the actual occurrences per 1,000 in the same 
limits. The following column gives the differences in the 
various groups between the calculated and actual numbers. 
The greatest divergence is near the centre, where there are 
12 more than were calculated. In the last column are given 
the differences if I had taken the normal curve instead of the 
skew curve. It is seen that by taking the curve as a skew curve 
the sum of these differences is diminished from 80 per 1,000 
to 64 per 1,000. 

I have now a rather difficult point to take with reference 
to one of those columns. Theoretically, j is calculated by the 
method of moments, the error of mean cube ; but in practice 
that does not give good results. A single observation a long 
way from the average has a very great effect on the mean 
cube. So that if in this number of 1,985 persons we had 
included two persons from a nationality where stature was 
very low, or where it was very high, we should have instances 
at a long way along the group which would not properly 
vitiate the comparison of the curve of error, hut would have a 



very unfortunate effect upon tlie mean cube. Instead of 
iiaving a lioiiiogeiieous group, we should have a group of 
l,9SS people from one group and 2 persons from another 
group wliicli would not belong to the same curve. There lias 
been a great deal of discussion as to what should be done 
with such abnormal cases. A good wav out of the difficulty 
is not to calculate j by the above method at all, but to 
calculate it by an a posteriori method, to chcmse that value 
of j which makes the misfit least. We have already chosen c 
so as to make the improbability less. Let us choose j by 
some similar test. The method I have adopted here is due 
partly to Professor Karl Pearson, and partly to Professor 
Edgeworth. It is to obtain figures (not given here) in such a 
form that it can be seen what value of 3 will make the sum of 
the absolute differences least. The value which satisfies this 
condition is found to be j = *06.^ The value obtained from the 
moments method is *016. This might have been used and 
would have given a result slightly better than the value j = 0 . 
But I am inclined to say it is better to calculate j from the 
a posteriori method ; I think it is quite as logical, and you 
are bound to get a better fit. 

Professor Karl Pearson has given .a test by which you 
can consider the following problem: — Supposing you had a 
population with certain characteristics, such as height, 
distributed according to a curve with a particular formula, 
required the probability that an assigned distribution would 
be obtained from the supposed distribution. Putting it into 
a more concrete way, suppose the equation of the height 
group for the whole population was this equation with 
c = 3*623 inches, and 3 =-06: required the probability that 1,935 
persons taken at random from the population would have the 
heights actually registered. Professor Karl Pearson has 
given a tablet with the necessary figures for determining that 
probability. Calculation from his table on this distribution 
shows that if we take the symmetrical curve the probability 
of obtaining such a selection is *4 ; that is to say, the chances 
are two in five that the 1,935 persons would not be further 
from the supposed distribution than they actually are. If 
we take the skew curve with 3 = *06, the probability is *7; 
that is to say, the odds are seven to three that we should 

^ See Journal of the Royal Statistical Society ^ June 1902, pp. 337-8. 

f See London, Ldin. and Dublin Phil. Mag., July 1900, p. 175. 



obiaiii 1,935 pereons as nearly conforming to this group as 
we hiiTt^ found. It is very difficult to argue back from the 
heis'ht of a person to the expression {p-Vqy\ and I shall 
not at present attempt it. I have shown above that we 
should obtain this formula of the curve of error if we were 
dealing with chances, with events whose occurrence was, 
by those tenn.s, in the binomial theorem. But the same 
eqaation will be obtained on very many other suppositions, 
and I have only taken the simplest. Before giving these, 
however, it is necessary to define a frequency curve.'^ 

If we are dealing with a group of measurements vrMch are 
distributed about their average so that the number of them 
which lie at any defined distance from their average, say 
between z and [^- -f dr) in excess of it, can be represented by 
a definite function, say / (r), of that distance, then the curve 
which represents this function, f.e., i/=/(r), is the frequency 
carve of that group. If the unit of ordinate is so chosen that 
the whole area contained between the curve, the ordinates and 


its extremities, and the axis of r, is unity, then / ydx=l if 

" 6 

a and h are the limiting values of r ; in many cases a and h 
are + x . Then if the quantity is selected at random from 
the group, the probability that it will lie between Xi and 


y.dx; the probability that it will lie between x and 
x + dx is y.dx. 

If we take the experiment I instanced at the beginning, 
the tossing of a coin, and make the number of times tossed 
very great, the chance of obtaining given deviations would 
be given by the curve of error, as already shown. This is the 
frequency curve for the group of experiments. Events are 
ruled by very different laws of distribution. We may have 
a very skew curve, as, for instance, in the curves of ages of 
wives in Yorkshire where the mode was a long way to the left 
of the average ; the smooth curve which best fits those 
observations would be the curve of frequency for the ages 
of such persons. That is to say, if we draw this curve, 
representing as nearly as possible the observed facts, and we 
make this area equal 1, the area standing on the part of the 
axis between the 35 and 4^>year marks would represent 
the chance of a pereon taken at random being between 
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35 and 40 years old. If we were gi’ven tie age of a man 
who had a ^vife in Yorkshire and we did not know her 
age^ that area would represent the chance that her age 
would he between 35 and 40. The life curve, to take 
another example, is a frequency-curve. To any frequency- 
curve we can assign a modulus calculated from the second 
moment. That tells one distinct fact as to the distribution 
about the average. The curve may have the greater part of 
its area to the left or to the right of the average, and 
it may have an asymptote as in the case of the curve 
of error; but there is in general only a small fraction 
of the area beyond two or three times the modulus, which 
may therefore he taken as indicating the practical extent of 
the curve. It is often useful to speak of the precision (h), 

instead of the modulus (c), where ^ = ~ • The greater h is, the 

more precise are the predictions that can he made as to a 
magnitude taken at random. 

If we are dealing with frequency-curves whose practical 
range is small and whose modulus is finite, and if we take a 
great number of these frequency-curves, or rather if we have 
to select from a great number of things whose sizes are ruled 
by different frequency-curves, for example, if we make up a 
line of a great number of pieces of metal taken from different 
heaps with different frequency-curves for each heap, it is 
possible to find the frequency-curve for the sum of these 
elements, that is for the length of the line you have made. I 
will put that in different form with a different illustration. 
Suppose we are going to take 100 books, and we can select 
them from 100 different groups of hooks whose thicknesses 
are bounded within definite ranges and have a different 
modulus which can be assigned, required the breadth of 
100 hooks put together. The most probable breadth will be 
that obtained by adding the averages of the 100 different 
groups. From the terms of the question it is obviously very 
improbable we shall get all the 100 below the averages of 
their respective groups or all above. The actual breadth will 
have a frequency-curve of its own about an average which is 
the sum of the averages of the groups from which you select. 
Its modulus can he shown to be the square root of the sum of 
the squares of the moduli of the original frequency-curves. 
Thus, to take a special case, if we are going to select two 
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onlv wliieli obey normal curves with the same modulus, 
the moiJiik> fur the siiiii is ^^2 times the modulus of either. 
The developments from this theory are of great practical 

importance. 

If we lake one sample at random from each of a number 
of these frc^iliieiicv-curves whose moduli are not very uiiec^ual, 
so lliat la'i one curve predominates, and add together the 
c|iia!itiiies so obtained, then the quantity obtained obeys the 
curve of error itself, whether the original frequency-curves 
were curves of error or not. I cannot give the proof here ; 
the theorem as I state it is partly due to Laplace and partly 
clue to Professor Edgeworth * That is one of the most general 
state^iiients of the cases in which the curve of error will arise ; 
and that conception may properly be applied to the conception 
of height and the causes which determine the persons' height. 
No single cause has very great inhuence compared with others, 
so far as we know, and they all presumably have measurable 
effects whose frequency-curves are definite. Thus, we might 
expect a priori the frequency-curve of heights to be the curve 
of error. 

Another illustration is supplied by the grouping of school 
children in a particular grade.t I took one of the most 
populous grades in the Report of the St. Louis Public Schools, 
U.S. A., grouped the children according to their ages, and fitted 
the curve of error by one of the methods I have described. 
The curve of error with c=l‘68, j = -073, fits the observations 
closely. If we think of the causes which determine the 
position of a child in a particular grade or class, I think we 
shall find that they are akin to those I have supposed in 
my statement as to causes which lead to the asymmetrical 
curve of error. But it would be absurd to go back and try 
to re- value p, q and n, the quantities on which the algebraic 
proof of the equation depended. We could find out, of course, 
what chances would produce this particular distribution ; but 
they would have no necessary relation to the facts. The idea 
I wish to give is that we can obtain the equation of the curve 
of error in the form I am using it on a very simple supposition ; 
and it can be obtained from many other suppositions which 
cannot be given in lecture w’ork. 

* See Edgewortb, in LoBd<m, Bdin. and JDuhlm FML Maff., 1892, p. 429. 
t For the nambers and see Elements of Statistics, 2nd Edition, Appendix. 



MEASUREMENT OF GROUPS. 


FOURTH LECTURE. 


The Method of Least Squaees. 


Suppose tliat we make a gi^eat many measurements of tte 
same quantity by several different methods ; and that, as is 
generally the case, the measurements differ from each other, 
owing to imperfections of instruments, or by the numerous 
accidental circumstances that attend any involved observations. 
Let us assume that the measurement.s which could be made by 
the first method are grouped according to the frequency-curve 
1 

2 /= 


y= 


Ci\/tT 
1 


those by the second method according to 


Co 


y TT 


’ , and so on, a definite normal curve for each 


method. Suppose we make n measurements, one of each kind. 
It is required to find what is the most probable value of the 
distance to be measured. All the we are dealing wuth 
are errors in our measurement. From the series of partly 
erroneous measurements it is required to find the most probable 
value. That is the problem it is attempted to solve by the 
method of least squares. As a second question, it is required 
to determine the precision of the result, that is, to state 
the probability that it is correct within assigned limits. 

Before going further I must call attention to one very 
important point in the reasoning. In the reasoning on which 
the method of least squares is based it is assumed that the 
frequency-curves are normal curves of error, as written above. 
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If the frequency-curve is not a normal curve of error the 
method breaks down at the first step. That I shall have to 
return to later. With regard to the moduli^ we may either 
suppose that we know them by some a priori method, as is 
sometimes the case ; or that we know them by ha\dng made 
similar experiments at some other time, e,g., if we are dealing 
with a group of height measurements where the modulus is 
three inches generally; or we may find them from the 
experiments themselves. A useful way is to repeat the measure- 
ment by each method, say, 100 times, and from the internal 
evidence find out what the moduli are. We assume that the 
moduli are fixed quantities, quantities which we cannot affect, 
and that they are known or previously determined quantities. 
*» What is the probability that a certain series of errors should 
result in n observations^? Let ajj, a^o, . . . . be the differences 
from the unknown true value which arise from n different 
methods taken in one series ; what is the probability that 
those particular 71 deviations will occur at once ? The proba- 
bility is obtained by multiplying together the probabilities of 
their separate occurrences. The probability of the error Xi 
occurring, when the modulus is Ci, is from its curve of 
1 

frequency e . The probability that the n mil all 

Cl V TT 

occur is obtained by multiplying n such quantities together, 
1 -2- 

that is, e c-. Here the only variables are 

7r2. C 1 C 2 . . . . c« 

the How, that probability will be greatest when the 

index of e is greatest, that is when S-g is least. Thus, from 

c 

all the possible values of the unknown true measurement, the 
system of errors which we have found would arise with the 

least improbability when is made the least possible. 

That is the statement which is at the basis of the method of 
least squares. In the particular case, when we take all the 
the observations by the same method with the same curve of 
frequency, so that c is the same for all the observations, the 
minimal condition is satisfied when the sum of the is a 
minimum ; and we have already seen that that sum is made 
least when the unknown value is taken to he the arithmetic 
average of the obtained values. Let me re-state this theorem 
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in otlier words. Suppose w’e start to measure a particular 
object by the same method again and again. Then, ilie 
measurements we obtain would come with the least improba- 
bility when the sum of the squares of the deviations is a 
minimum ; and that condition is satisfied if we take the 
arithmetic average of our measurements to be the uiiknowii 
true quantity. This statement is a particular case of the 
method of least squares. 

When ive have grasped that initial principle, the rest of 
the investigation is only a matter of the differential calculus ; 
there is nothing special about it. We have to write down all 
the equations that connect the quantities we are measuring, 
and then by the ordinary processes of the differential calculus 
express the conditions that the sum of the squares of the 
errors shall be a minimum, and these will give enough 
equations to solve for ail our unknowns. I will illustrate that 
algebraically by a particular case. Take the case with which 
we have already dealt, namely, that in which we had the ages 
of the wives in Yorkshire. There we obtained a somew^hat 
irregular curve representing the numbers at different ages, 
and we smoothed that curve by putting parabolic curves of 
the fourth degree through various points; and it will be 
remembered that we had to change the constants in our 
equation according to the particular group of five points 
selected. Now let us assume that we have a parabolic 
equation of the third degree in this form, 

^ = ao + aii25 -f -h a^. 

This equation has four unknowms; we can therefore make it 
pass through any four assigned points, hut we cannot make it 
pass through five assigned points. Suppose that we wish to 
determine an equation of the third degree which will pass near 
the five points, then we will apply the method of least squares to 
that problem. Let the co-ordinates of the actual observations 
be {xiy mi), {x2, and so on. Let the corresponding points 
which we are to find on this particular curve be (xi, 
and so on. The point (^1^1) will be near, but probably not 
coincident with, the point (ajimi). The difference between 
mi, the observation, and yi, which would be given by the 
curve which we have not yet determined, is the error of the 
observation. We are to determine the constants so that the 
sum of the squares of those errors shall be least. Writing 
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dial a little more fully, and substituting for y in terms of 

have that 

1 itli — Uq — Cl lii i UstTi) " 

is to he a iiiiiiimuni. 

In that expression the variables are the four a^s, which 
have to be determined so as to make the expression a 
iiiiiiiniiim. Therefore we must differentiate that expression^ 
when it is written out, -with respect to Uoj clst 2tnd 

equate these partial differential coefficients to zero, obtaining 
as many equations as w^e have unknowns. Then we have to 
solve the equations so obtained. 

After a little simplification the following equations are 
obtained : 

5 . u® -f Sxi . Ui + 'Zxi - . 02 + . 03— 2m =0 

2*^1 . 0 © + 2a’i- . Oi + . Oo + 2trd . Ua — 2ma? = 0 

2a’i- . Ot> + 2cr Mi-t . 02 + 2a’i^ . 03 — = 0 
2,2 ’i^ . Oo + 2r . Oi + 22’!^ . Oo + 2a?i® . 03 — %mx^ = 0. 

The chief thing I want to say abut these equations is, 
that they are so complicated, and a solution is so laborious, 
that they must be put out of court for all ordinary 
calculations. If you wish to construct a new table which 'will 
be of some general use, it may be worth while to go through 
the solution, but not for any single practical piece of work. 
Ever}’ one of those separate terms, 22?], &c., have to be 
calculated arithmetically, and the equations have to be solved. 
Even in this simple case we have four equations each containing 
four functions. In Merriman^s Method of Least Squares, 
the simplest methods for that evaluation are given. Many 
terms drop out, and the evaluation is possible ; and in some 
cases we can so choose our origin and take advantage of 
certain points of symmetry in the equations, that the work 
can be simplified. In this particular case a simple solution 
has been given by Professor Darwin.* 

Fitting Foemulji to Obseevations. 

Before we look for another way, let us consider again 
whether the assumptions on which the above method depend 
are justifiable, or will justify the great effort which would he 

* See Darwin, “On EalliMe Measures,” London, JSdin, and BuUin JPML 
Mag., July 1877 ; used in JElements of Statistics, pp. 256. 257- 
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!ieces:^ai*y to >olvo the er|iiatioiiS. I think it moll fioiiiil 
that in general they do nor. If we look back ilirontrli ilic 
it will be seen that the urigiiial a>-iiiiipiio]i i- tliat 
the diifereiiee between the actual ii umber ijt peioi.ui.- 
obserredj and y, the number obtained fruin the eqiiatiiui, 
belongs to the normal curve of frequency; and su in every 
case wdiere the method of least squares applies we have an 
observed measiiremeiir, and w’e obtain a theuretieal measure- 
mentj and we assume that the difference between the two 
belongs to a iioniial curve of frequency. Before we can 
make that assumption we must verify that the conditions, under 
w'liicli the normal curve of frequency is obtained, are satisfied. 
We are not in a position to do that, if ive depend only on the 
algebraic proof given above, without investigating the 
deductions of the equation of the curve of error resting on 
other hypotheses. But to my mind there is no proof yet 
given ivhich does show that the normal curve of error will be 
obeyed in the circumstances I have just mentioned ; and 
Professor Karl Pearson has shown that in very many instances 
the normal curve is not obeyed. So the theory is at any rate 
difficult to establish a priori , and is not supported by universal 
experience. I think, -with all the deference that is due to 
Professor Karl Pearson, that the matter yet wants more 
practical experience before it can be fully decided. It would 
be unsafe in the present state of the argument on the one 
hand to say that the normal curve of frequency may be 
expected; or on the other hand to say definitely that it is 
not to be expected, because it has not been universally 
found. That is too difficult to deal with at all thoroughly 
here. The reason I have gone so far into it is this: if the 
method of least squares is very difficult to apply, and if it is 
neither supported sufficiently by theory nor by experiment, 
then it seems expedient to try some other method. A purely 
empirical method would be this : Instead of making the sum 
of the squares of the deviations a minimum, make the sum of 
the first powers of the deviations, all reckoned as positive, a 
minimum, that is to say, remove the square outside the 
bracket in the expression on p. 48. But it is not at all easy 
to make that sum a minimum, because all the terms have to 
be taken as positive, and we do not know until we have finished 
our work which terms are naturally positive or which terms 
are negative. Professor Edgeworth has given a method of 
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getting tiie eolation when there are only two inikiiowiis* 
When tliere are three unknowns I believe there is as yet no 

practical soliition. 

Another method, still taking the method of least squares 
as the basis, but avoiding the very complex solution, is to 
choose the coefficients, so that the curve will pass through 
exactly the four points assigned ; and then re-calculate them, 
so that the curve shall exactly pass through four other 
assigned points ; and so continually calculate again and again 
the coefficients, getting a series of curves. Then from the 
various values of the coefficients so found, choose those 
coefficients which appear to give the best results. It is really 
a makeshift method. I think it has been often employed, 
and the results have been very satisfactory. If, by one method 
or another, you get coefficients w’hich make the theoretical 
curve pass near the original curve, it does not matter by 
what process you have got them. Such a method as that, I 
think, is in general use for approximating to the population in 
inter-censai years. I think the Census Office has never 
published this method ; but as far as I can find out, the 
method employed is as follows : Supposing certain points 
represent the population at the various dates at which it is 
exactly enumerated, then if, as a first hypothesis, wm assume that 
the population increases in geometric progression between tAvo 
enumerations, we obtain a simple curve passing from one point 
to the next. Then assume again that from this Census to the 
next there is another increase in geometric progression, and we 
find that the two curves never have exactly the same constants. 
Then obtain some method for passing from one curve to the 
other Avithout a sudden break of curvature, reject the parts of 
the curves near the Census years, and replace them by a curve 
which gradually passes from one to the other. That is a purely 
empirical method, and I think it is the one adopted. It is in 
some such Avay as this that Ave can go to Avork if the method of 
least squares is too complicated. 

The third method, to which I Avish to call attention v^ery 
particularly, proceeds in quite a different way. We tabulate 
our observations as before, and Avrite doAAm the equation of a 
curve which is assumed to fit them, Avith unknoAvn constajits ; 
calculate from the observations the moments — first, second, 

* See Ikigewortli, a New Method of Kedncing Observations,^’ JRMl. 
Ma£, 188®; in Jourml of Rogal Statistical Society, June 1902, p. 341. 
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thirdj foiirtli (as many as there are unknown- — aljuiit the 
centre of gravity, by the method used above, and ealeiilate 
the iiiomeiits from the assumed curve in toriii> ef the 
unknowns. Equating the moments found from the «jljservatii.Jiis 
with the moments found for the assumed curve, we have 
equations determining the constants. For example w’e may 
take the instance already discussed, %vheii we found a skew 
curve of error to fit certain observations. The general 
equation to the skew curve of error being given, by the help 
of the integral calculus w’e stated the values of the first, 
second, and third moments in terms of c and j ; we ecpiated 
these to the moments calculated from the observations, and 
thus found c and j. We need to calculate as many moments 
as there are unkiiotvns in the particular equation selected. 
For instance, in Makeham’s formula there are four unknowns, 
and we have to take four moments. In the normal curve of 
error there are two unknowns, its centre and the modulus ; 
two moments are therefore sufficient to find the normal curve 
of error by this test. In the skew curve of error, the quantity 
j has to be determined in addition. In the empirical equations 
given by Professor Karl Pearson in his well-known paper on 
the measurement of skew groups, which was published in 1895 
in the Proceedings of the Royal Society, there are four 
unknowms, and therefore in general he needed four moments. 
In the parabolic interpolations, such as I have used in these 
lectures, there are as many unknowns as we like to take. If 
we stop at we need four moments. In Professor Pareto’s 
empirical equation for the grouping of the incomes of the 
people of a country there are two unknowns. The 

A 

equation is as follows : y = — , where y is the number 

of persons in receipt of income x, and A, a are constant. 
It is also given in a developed form with one more 
constant. It is supposed that the index a is nearly the same 
for all countries, while A varies from country to country. 
You could obtain those values by the principle of least 
squares, or by equating moments. This is not the place to 
criticise the equation : I only give it as an example of 
algebraic equation for statistical grouping. We see then how 
to obtain sufficient equations for the unknown constants, 
and so we come naturally to the question of what is the 
justification for this method. I think I must refer you, in 

E 2 
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general, to Professor Karl Pearson's paper for the justifications, 
because it is his method, and in particular he has quite recently 
published a paper in the journal Bioraetrilm,'^ going very 
carefully into this whole method; and all I can do is to simply 
follow in his steps. The method depends on a purely empirical 
basis, not on any 5 priori theory. By its means ive do, as a 
matter of fact, obtain an equation wdiich fits the observations. 
But, incidentally, Professor Karl Pearson shows that the results 
obtained are, in general, the same as those obtained by the 
method of least squares. Without basing his system upon 
the coincidence at all, he does obtain the same results. The 
advantage of the method is, as he has also shown in the 
same paper, that the solution of the equations obtained is 
very much easier than the solution of equations obtained by 
the ordinary method of least squares. I hesitate to go further 
into this subject because it is Professor Karl Pearson's subject, 
and all his papers are very easily accessible. He has shown 
that empirical algebraic formulae can be found for a very 
wide range of groups, and in every case he has fitted equations 
to the groups by the help of this number of moments. He has 
then found that the equations so obtained do fit the groups 
exceedingly well. Groups may, perhaps, contain 30, or 40, 
or 100 measurements, but the constants at disposal are only 
4. If you calculate these 4 constants by any method and 
obtain, as a result, the equations which fit a -wide range of 
observations, you have a strong empirical justification for the 
method. I believe that is the justification which Professor 
Karl Pearson gives for the method. But we are met face to 
face with this difficult question, which it is impossible to deal 
with here and now : How far ought we in such investigations 
to take empirical formulae which are only justified by their 
results, and how far should we base our reasoning on a priori 
assumptions as to the nature of error, and as to its occurrence, 
assumptions wffiich underlie the theory of probability, and 
from such assumptions obtain our equations? Should we 
obtain our equations with the view to fitting the result, or 
should we obtain our equations from a priori reasoning and 
STO how far they fit the results ? To my mind we have not 
nearly enough experience in the matter at present. We have 
not sufficiently tested the fitting of groups to the h priori 

* Biometrika, April and 1002. 



equation?, nor have we yet ?ufiieieiit experience to say tiiai 
rile empirical iiietliosi is iiniver>ally >ati-faerory beeaime it lias 
been found to iit wide ranges of gr«>up>. At that poiiit I 
leave the diseu>sion. 

Uses of the Cueve of Kimom 

Wliatever may lie tlie ultimate decision in the questions 
wliicli I have thus stated, there are certainly niaiiy uses: for 
the curve of error in the form in which I gave it in the last 
lecture, quite independently of the discussion w'e have just 
been engaged in. In tvhat I have been recently saying I have 
been following, as far as possible. Professor Karl Pearson’s: 
method. In what I shall sav notv I am following* Professor 
Edgeworth’s tvork. 1 do not mean that the two are 
contradictory in any way ; I wish to indicate that I am 
trying to sammarke the present position of this question on 
the lines of the two most eminent authorities in this particular 
work. For clearness, I repeat the method of generating the 
curve of error given on p. 43. Suppose we have a number of 
frequency-corves, each of small and limited range, that is to 
say, of great precision, its modulus being small ; let the 
moduli of n such curves calculated from the squares of the 
deviations be Ci, Co, . . . c». The curves may be of any shape, 
except that no finite part of their areas may be at a great 
distance from their centres of gravity. Suppose we take aj 
observations belonging to the first curve, out of the 
second, and so on, and add them together; the curve of 
frequency for the resulting sum is the normal curve of error 
with modulus >v/(SaA*-). If instead of taking the sum, w'e 
take any other function to which the sum is the first 
approximation, the curve of frequency for the values of 
this function is likely to approximate to a normal curve of 
error ; but we will here limit ourselves to the sum. The 
following diagram and the experiment on which it depends 
illustrate this theory. I took Chambers’ mathematical tables, 
and chose three digits at random and took their average, and 
repeated this a thousand times. The curve of frequency of 
the 10 natural digits is a straight line; you are as likely to 
get any one of them as any other, if you select a suitable part 
of the tables. I have represented that curve of frequency 
by ten dots. It is limited at both ends, its modulus is fairly 





big*j viz. : 4-OG, and it supplies a very severe test Cif tlie 
principle I have enunciated, because we have a eiirvc" of 
frequency which is absolutely different tlie iHiriiial 

curve of error ; it does nor approximate to it in any way 
wdiatever. The actual probabilities of the occiirrcTiee of 
various numbers are the successive coefficients in the 
expansion of (1 . . . -f Comparing these 

with the result of the experiment we have the following 
table : — 


Average of 

3 diyrit-s taken 
at random 

No. OF TIMES THIS AvERAOE 

Was actually 
found 

Miylit be ! 

exjieeteji every i 
1,CX>0 times 

0 

0 

1 

X 

4 

3 


11 

6 

1 

10 

10 

U 

14 

15 

^ 1 

17 

21 

2 j 

26 

28 ; 

; 2-?r 

33 

36 

2# 

53 

45 

3 

48 

55 ! 

! 3-?. 

60 

63 

1 ^'3 

82 

69 

i 4 

76 

73 

1 

72 

75 

■ 4 

73 

75 

i 5 

75 

73 

i 5^ i 

61 

i 69 


65 

; 63 

i 6 

60 

i 55 


35 

45 

: 6§ 

35 

36 

7 

! 29 

28 

^ H 

30 

21 

n 

15 

15 

8 

3 

10 

sy 

6 

6 

8i 

7 

3 

9 

0 

1 

1 


It is not my point here to show that those figures are what 
you would expect to get ; what I wish to show is, first, that 
the successive probabilities, when they are plotted out, 
resemble the curve of error ; and, secondly, that the experiment 
tends to fit a normal curve of error. In Diagram IX the 
continuous line with dots on it is the frequency which yon 
would expect. The broken line is the curve of error, with the 



-all:- liiid and the crosses are the positions 

itfiiaiiiril fi'Miii the actiial experiment. It is seen that 
iliMiirli We yinried with a freijiieiuy curve wliieli was a straight 
ihai the tlie^wetical curve which we obtained for the 
livvnme nf *jnly three terms selected from it is alread}' so iiiiieh 
like a curve of emr that you would mistake it for one^ if a 
model was not traced on the paper ; and that the actual 
ex|ieriiiieiu supports tlie same view. 

We note that the modulus calculated from the squared 
lieviatioim for the natural digits is 4*06, and that from the 
formula given above the modulus for the sum of 

three digits should be x (4-d6)-= 7*032, and for the average 
of three digits should therefore be 2*344. The modulus of 
the curve given by the calculated probabilities of the various 
numbers is 2*345, tvliile that calculated from the results of the 
experiment is 2*358. The averages are 4*5 (theoretical) and 
4*494 (experimental).* 

CONSTEUCTION OP A GeOCP PROM SAMPLES. 

The theory which I have just enunciated, for the proof of 
which see the reference given on page 44, is, that if we start 
with any frequency-curves, and take our examples from them, 
one from each or many from one, and take the average, we 
shall obtain a curve which becomes more and more like the 
curve of error as we extend the number of our examples, 
and as the frequency-curves satisfy more and more nearly the 
limited conditions which are laid down for them. Now, that 
is not only a mathematical theory : it has very great practical 
importance. Supposing that we take a number of samples 
out of a large group, how near the true average may we 
expect to get ? If the curve of frequency of the group was 
a curve of error, we can at once write down the probability of 
different divergencies. If we have a curve of error with 
modulus c, and we select n samples at random from it, and 
then take their average, the modulus for their sum is from the 
formula already given, and hence that for their average 

c 

is y =. The precision of the arithmetic average varies inversely 

as the square root of the number of items, a very well-known 
principle. I wish to show how this theory can be adapted to 

* See in Jubilee Volume of tbe Jonrmal of the Ro^al 

SimtuUeal Soeiei^, p. 186 . 



eiirves of frequency other than the normal eiirve sif err^.r. 
Suppose the original curve of frequency to be anv curve 
whatever^ a curve of survivors for example, I ilo umi a --unit* 
any particular shape to it. Sup|,)c»se we g<j tliriiiio-li an 
experiiiienr, taking, w'e will say, m examples at raiicloxii 
it, and repeat the process k times. [In the experiment 
just discussed m was only three, and I: was 1000.[ T!ion£tli 
the original numbers do not obey the normal curve of error, 
yet the average of m of them may be expected to, w^hen m is 
siifRciently great. Let c be the modulus for the group of 

c 

averages of in samples ; then may be expected to be the 

modulus for the average of the whole mass of km. samples. 
Thus, in the above experiment, c was 2*35, k 1000, and 
c 

^==*064; the knotvn a%’erage for all digits, which formed 

the original curve of frequency is 4*5, the average for the 
3,000 selected, in 1,000 groups of three, wms 4*494; the 
difference is one-tenth of the modulus just calculated ; so small 
a difference might be expected once in nine trials. 

Thus, whether the curve of frequency of the original group 
is the normal curve of error or not, the precision of the 
average of a great number of samples is proportional to the 
square root of that number. 

Now let ns see how to construct not merely an average, 
but a whole group, by the method of samples. 
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111 

18 

19 

( XO 

g 

30- „ 40 - 

134 

21 

16 

) 


40;- „ so- 

206 

32 

30 

^ 12 

13 

so' - „ 60;- 

105 

17 

18 i 

) 1 


60/- „ 70/- 

47 

7 

11 1 

^ 3 

3 i 

70/- „ 80/- 

26 

4 I 

4 i 

) i 


Above 80 - 

4 

1 

2 

0 

1 


636 

100 

100 

25 

25 1 

Average 

43,9 


45.;4 

! 

f 

1 

46,6 1 



Diaqeam 
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The table and diagram give tlie re>iilt of an experiiiieni in 
sncli constriicrioii. The material of the experiiiient »jf no 
importance here; I merely took the most aceessible tiirures 
conduct the experiment, namely, the official Gazette I'irice- of 
wheat for the 636 months for whicli they are recorded in tin- 
statistical abstracts, and regarded that as a group rd* thinirs 
which I was going to build up by sample. For eoiiiplete 
illustration I had to take a group I knew, and tlien to take 
samples of it. In general, of course, the group is not known, 
but lias to be constructed from the samples. The actual 
group is that given in Diagram X in the continuous lines. T«> 
obtain the samples, I took Chambers' mathematical tables, 
and assigned to particular numbers, from 001 to 636, 
certain months, and took 100 numbers of three digits at 
random. Xext, I wr-ote down the prices in the 100 months 
corresponding to those 100 numbers, and grouping them in 
10.S‘. groups, obtained the numbers given in the third column 
above, and also given by the crosses in the Diagram X (a). 
I next selected 25 samples by taking the first 25 of the 100, 
and I grouped the figures in 206*. groups, and obtained the 
numbers given in the fifth column and by the crosses in 
Diagram X (b). What rule have we for deciding how near 
the true group the sample is ? In the third division, for 
instance, between 80.§. and 40^. in the whole group, there are 
134 instances, and 21 per cent, of the area is bet’ween 30.s'. 
and 40^. If we take 100 things at random out of the whole 
group, how^ many of that 21 per cent, are we likely to get ? 
This is a simple problem in probability: if qi samples are 
taken, the chances that 0, 1, 2 , , . n will come from a given 
part, which is to the whole as is p to 1, are the successive 
coefficients of the expansion of {q-\-p)^y where q = l—p; as 
71 increases we approximate to a curve of frequency with 
modulus \/2pq7i (see p. 34). In the third division p = *21, 'while 
71 , the whole number of samples in the first experiment, is 100 . 
Here y 2 p 5 n= \/(2 x *21 x '79 x 100) =5*8. The difference 
between the actual number per 100 in the group, namely, 21, 
and the number found in the sample, namely, 16, is less than 
the modulus. In all the other cases in both experiments the 
differences are within the probable error " (which is *47 of 
the modulus, see p. 36); We have thus found a criterion of 
the divergencies to be expected between the distribution of 
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iiiainiitiiclrs in a croup of samples and the distribution in tlie 
iiiikiiowii croup from whicli they arise. 

A> n^^cards the preci>ioii of ihe averages of the samples^ the 
iiefilidii- far the original group is about 196*._y and, therefore, the 
iiiuiiiili hjY the averages of 100 and of 25 samples, respectively, 
are -r~ \ duO=Ls. lid., and 19^9.-^ a/ 25 = 35. lOd. The 
averages found from the samples are actually 45s. 4d. and 
Ms. lid. wliicli are, respectively, 1^. Id. and "Is. 9d. in excess 
of the average of the whole group. 

The experiment, therefore, forms a good illustration of the 
theory, and on consideration it will, I think, be found that 
the theory is in strict accordance with common-sense and 
eominoii experience. 



MEASUEEMENT OF GROUPS. 


FIFTH LECTURE. 


Correlation between Two Groups. 


JuET there be 7 i pairs of measurements {ttiiji) GF2I/2) and so 
on up to {xnyn)} the members of each pair having some 
determinate connection with each other ; for example, suppose 
that the £c’s are the ages of the wives in the group taken 
above, and the y’s the ages of their husbands, Xj. and being 
the ages of a married couple. This is the example discussed 
below. Or suppose that Xr is the age at which a man dies, 
and yr the age at which his father died ; or suppose that Xr, yr 
are measurements of physical characteristics of the same man. 
Or again, Xr might be a death rate, in a year in which yr was 
the average temperature. It is required to measure the 
relationship between x’s and ^^s so as to answer this question : 
Given one of the aj’s, assign the probable value of the 
corresponding y. For example, given the age at 'which a man 
died, assign the most probable age to which his son will live. 
Or, taking one member of the group of "wives at random, state 
the probabilities of the age of her husband. We have in fact 
to give numerical expression to such statements as these : 
A high death rate goes with a low temperature ; a long-lived 
father has long-lived sons ; for two statements where 
two measureable quantities are connected in that way, where 
in common parlance we connect them "with simple adjectives, 
we have to find a numerical or mathematical expression for 
the relationship. First suppose that there is no causal 
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between lirmips. Then if we select any 

partieiilar place uii the axis on wliicli the are measured, 
liiiii mark in the c»;»nv>p«jnding y's, we shall get a group of i/’s 
wlifj-'tc iiVi-ratfe is epiiallv likely to be above or below tiie 
averaifi/ «‘f all ilie //’>• Suppose we choose a group of wives, 
the 25 and 30 , mark in the ages of tlieir 

liii^biiiiij-, and mark the average of such ages, if there is no 
v^jimvcikm lietween the ages of the one group and the ages of 
the * it lit'f, the a vera.tre of the group so taken will he near or 
ei'liial t-o the average of tlie whole group of husbands, iiaiiiely, 
42 years. And so, if we take another period and mark in the 
various UL^es of the husbands we should again find the average 
near the average of the whole group. If the ^fs are 
re|:sre-eiiteJ uii a horizontal axis, and the ^hs are measured 
Vertically by puiiits placed abuve the values of .e which are 
their |iairs, tlieii if there is no causal connection ])etweeii the 
mairnitiide <.d the and of the ?/ts, the averages of groups of 
tlie y's curre>|)Oiidiiig to assigned intervals on the axis of c/! 
will all lie near the horizontal line through the averages of 
the ?/’s. They ttill not lie on it, but the best straight line we 
can draw near these points will he a horizontal line through 
the averi^ige ; that is obvious as soon as the statement is 
understood. 

But now suppose there is a causal connection between the 
two sets of measurements ; suppose, for example, that a high 
value of .c goes with a high value of y. Then if we start from 
the average value of d% ’which we may assume for the moment 
corresponds to the average value of y, and pass to the right 
and choose a group at a place above the average for the a’^s, 
the ys which are obtained for that group will be distributed 
about an average above the line. And as we continually 
mark off the averages for group after group by points, they 
will lie on some curve which tends”upward to the right from 
the origin and downwards to the left. (See for example 
Diagram XL) If, on the other hand, a high value of £ went 
with a low value of y, there is a change of sign ,* the series of 
averages w^ould go down to the right and up to the left. The 
exact method of drawing a line through these points I do not 
propose to discuss very minutely. We could draw a smooth 
line by the methods discussed in the first lecture, or a 
freehand curve. We can either draw a straight line as near 
as possible to the dots, or 'we can draw a curve. I shall 
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not discuss the general shape of that curve: I sliall iii€“‘relv 
assume that^ from the observation or otlierwi-e, wt.^ can tiniw 
that curve. And since in any series of obscr vat ions liit? 
particular averages are liable to slight dispiaeemeiit. in a 
finite number of observations we do not get the most prcdiable 
point with each average, and must smooth the line in tlie way 
w'e have discussed. We may assume an equation, 
which gives the average of the y'a for the particular values of 
a’ ; that is only giving a general form to the statement, that a 
value of y is connected with a value of £ by a deteriniiiate 
equation. 

This equation, of course, only gives the position of the 
averages of the selected groups of ?/’s. Everyone of these 
groups has its own frequency-curve. If we select again the 
ages of the husbands of those wives whose ages are Ijetween 
25 and 30, tve can draw a frequency-curve lor that gi*oup of 
husbands, but the centre of that frequency-curve will no 
longer he at the average age of all the husbands, if there is 
causal connection between the groups,* but as the group taken 
is below the average of the wives, the centre of this curve 
will he below the average for the husbands. It is not 
necessary, in general, to make any attempt to draw this 
frequency-curve point by point, but only to take its centre 
and in some cases its modulus. Instead of dealing with 
arithmetic averages, we may equally well use the medians of 
the groups. 

We might take, for example, such a question as this, a very 
old question: Has the price of wheat anything to do with the 
marriage rate ? In such a case as that we plot out the prices 
of wheat in different months or years along the axis of 
and put in ordinates showing the average marriage rate when 
the wheat was that particular price, and the direction of this line 
or the form of this curve would give, within certain limits 
dealt with below, the answer to this question, whether there 
was a connection between the two or not. If we do obtain 
from our observations that there is a tendency upwards to 
the right and downwards to the left, or vice versa^ we have 
found that there is something common in the system of 
causation which produces the two sets of phenomena. We 
cannot say that the x’s are the cause of the y's, nor vice 
versa, but only that the two phenomena are not absolutely 
independent. 



The Coefficient of Coeeelation. 

We liave to find a imnierieal measure of tliat depei 
If tlie curve that we sibtaiii is a straiglit line, we have < 
find a means of calculating its iiieliiiatioii. Before proc 
to let 11' spend a few words on tlie case when the 
is not a straight line. Suppose that we have sii 
ob>ervatioii> to determine bv e.vperimeiir and observat 
actual .'liape of this curve from large groups, we 
without applying any further theory whatever, establi 
connection between the and the i/s ; the curve « 
plotted out, and given algebraic expression, if possibli 
then we should be able to say that for a particular val 
the most probable value of y was the one obtained < 
curve. We could have a curve simply from experieiic 
use the experience with similar phenomena at aiiothei 
For instance, if we had that experience of the length 
lives of the children of parents who lived to various ag 
should be able from this empirical curve, to say if a 
father lived to a certain age then the chances of the 
the son are given by a frequency-curve whose centi 
found from the empirical diagram, and whose shape 
very likely be knotvn also. In many cases, ho'wev^ 
curve of averages is approximately a straight line. E 
the approximation is not very exact, it may be use 
calculate the inclination of the straight line that 
nearest the averages. Let us suppose that we ha^ 
equation of this line, y=ax + h. Consider any obsei 

yr ; if this observation lay exactly on that line, yr wc 
ax-r’i'b. If the observation does not lie on the lii 
distance from it, measnred parallel to the axis oi 
y^—(axr + b). To obtain the best values for a and 6, 
are the only unknown quantities, we can proceed* ' 
method of least squares, and make the sum of the s 
of such quantities as yr—(^r-hb) a minimum. Thi 
differentials of X{yr'-’axr—h)^=zu (say) with regard b 
a and b must be zero. 

Thus = 2aZx^ — 2Xxy + 252® = 0, 

^ =2?i6+2(x2«— 22y=0. 


* gfee below p. 73. 
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Choose the axes so that both the .r’s and the i/s fire 
measured from their averages, then and tlie 

equations give us 5 = 0 and a=r^~; the line required passes 

through the origin, and its equation is Let 

be the standard deviations of the groups of i’’s and of so 
that n€ri^='Zx'y n<Ti=:%y- and let then the above 

llU" i€r2 

equation becomes — 


ncTj- 0*1 


that is, r • — ,* 

0-2 CTi 

In order to make r symmetrical, it has been necessary to 
divide by o-i and o-o, that is, to measure x and y by their 
standard deviations. It is a very natural thing to do. Before 
■sve can get any numerical comparison, we must reduce them 
to some common measure, and a common unit which we can 
very reasonably adopt is the standard deviation for each of the 
two things. If we are dealing with the question I suggested 
just now — ^the marriage rate and the price of wheat — we 
cannot compare shillings with a rate per thousand, but we can 
compare a ratio of the number of shillings to a standard 
number of shillings, with the ratio of the rate per thousand to 
a standard rate per thousand. We are then comparing 
absolute instead of concrete quantities. We should get 
similar equations if we used the modulus instead of the 
standard deviation, or the probable errors, or the mean 
deviations. For rapid work we could replace the <ri and 0-2 
by the probable errors, which are proportional to the standard 
deviations in curves which approximate to the curves of error. 
It is to be noticed that we can express the quantity r in the 
following form: r is the average of such products as 

. r is called the coefficient of correlation. It is not 

fTi 0*2 

difficult to show by pure algebra that the quantity r so 
determined must lie between +1 and — It ; and that r equals 
+ I3 only if the ratio of every x to its corresponding y is 


^ The last few paragraphs are sahstantially the same as those given by 
Mr. Yule in the Journal of the JBioyal Statistical Society , 1897, p. 817 seq. 

t See Elements of Statistics y p. 319- 

r 



ideritically the same as the ratio of every other 
eorrespuiidinjr that the nitio 1/ or is eoiistaiit^ and 

If the ratio is constant and r becon 

Cl 

aiiil an increase of .-.r eorresponds to a dirniiiutioii in y. 
is always between +1 and —1, and between tliei 
there is a scale of correlation. For instance^ we can 
the correlation between two sets of pheiiomeiia is *6 
Of course^ wdien one is first introduced to a new scab 
sort the numbers in the scale convey no meaning ^ 
matter of experience to attach the right value to the 
niacriiitudes in the scale. Perfect correlation can be uii' 
from the statement that groups are perfectly correla 
deviation of a member of one always equals the deviat 
the average of the corresponding member of th 
mukiplied by an assigned constant. If the two 
marriage and wheat prices, were perfectly (ne^ 
correlated, you would be able to establish some such 
as this : An increase of *1 in the marriage rate ii 
found with a diminution of 6d. in the price of wh 
course, such a rigid relation is never obtained unless 
some physical cause binding the two things together, 
ratio of corresponding pairs tends to constancy, the co 
becomes more and more perfect. That must be regai 
definition of correlation. 

Now consider the sum of the products of x an( 
let us write X for - , and T for — . 

CTi a2 

If there were no correlation, if we selected the ^ 
T which corresponded with a particular small r 
values of X, we should be likely to find a negative 
neutralize each positive value of Y, and the product 
from that range of X^s would tend to zero, and th( 
the number of terms the less the distance of their 
from zero. But directly there is any bias towards 
the positive value of T for this particular range o 
we increase the terms we may still get negative te 
and there, but on the whole we shall get positive te] 
so on, all the way up the scale of X^s. When 
correlation it is clear that the sum of the products te; 
greater than where there is none. Thus it seems 


from first principles that the quantity r thus calculated will 
make a good measure of correlation. 

There is an important caution to be given in the use of 
this formula. If, from two series of phenomena which were 
absolutely unconnected, we took a limited number of examples, 
say a thousand, and worked out the value of r, we should not 
obtain exactly zero, or rather the chances are very niiicli 
against obtaining exactly zero, even if there was no correlation ; 
and if we took a very small number of examples the chances 
are very much against obtaining anything near zero. As we 
increase the number of samples, if there is no correlation, 
the coefficient will tend more and more nearly to zero. "What 
we require before we can use the coefficient is some criterion 
to enable us to know whether the formula is significant, 
or whether the actual number might have arisen if there had 
been no correlation whatever. Such a criterion is given below 
on p. 88. 
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Husbands 






Years 










1 Median age 

•22*8 25*8 29*3 34*0 38*9 43*9 48*8 53*7 5S*6 63*2 

68*0 

72*5 

76-4 80*2 



Average age of Husbands, 42*16 years. 

Standard deviation, 12*6 years. 






Awrage age of Wives, 40*11 years. 

Standard deviation, 12*1 years. 






OwIBcient of correlation is *06 approximately. 













The numbers given are in every case the nearest thousands. 


A numerical example I have prepared will put the 
calculation in a clearer light. The table here given shows 
the numbers (to the nearest thousand) of the wives and 
husbands in the County of York in 1901 at various ages, 
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ill perinik live vear<. For example, if yc 
the liu^lniiais' a 2 'e> from 40 tv 4-i and look along tlie 1 
will ,^ee that tliere are two wive> lietween 25 and 30 
betweeii oU and 35, 25 between o5 aricl 40, and so 
was lioi practicable tu deal with blO,000 cases, and 
ilierefore dealt witli tlie tliousands only, and appro: 
tbrcmglioni the calculation. The fact that the numbi 
diagonally down the table as they do shows at once t 
correlation. I have taken a case where the correlation ii 
perfect. If we liad a table where the correlation was vei 
we should find the iimiibers distributed in random fas' 
over the table. In such a list of figures as this it is n 
practicable to take the arithmetic average. It is eas 
as accurate to rake the medians. I have approximatec 
medians for all the groups, both horizontal and vert 
the methods already explained. To take a particular e; 
consider again the husbands who are between 40 and 4 
of age. If you look along the list you wdll find ther 
all 78, and that the median age is 40*7 years. Or if y 
a vertical column, if you choose those wives 'who are 1 
40 and 45, and look vertically downwards, you will fi 
there are in all 77 of them, and that the median age < 
husbands w'as 43*9. 

The diagrams show’ the medians graphically. In 1 
the ages of wives are measured horizontally, those of hi 
vertically. Above the middle point of each five-yeai 
is placed a dot indicating the median age of husband 
wives^ ages come in that period. Thus, looking upw’ar 
the position of 87| years, the middle age of the group < 
betw’een 35 and 40, you will find that the dot indicai 
median age of the husbands is placed at the 38*9 year 
see that the points so obtained lie very nearly in a 
line. At the top and at the bottom the line bee 
little bit curved, for the influences of the lower an< 
limits of ages make themselves felt. If we tried 
normal distribution for the group of wives who are r 
we should be getting husbands at 13 and 14 years of j 
at the other end of the scales we should have got h 
at ages at which there are no people alive. The fact 
scale is limited at both ends is the cause of the defl( 
that curve from the straight line. We have now to 
the inclination of that line. In the case I have takei 


Showing correlation heiwcen ages of hnshands and their wive 



Ages of wives. Ages uf huHlianilH. 

The taiigeiit of the ineliimtioii of the line thmugh the dots is *97 (nearly). The liuigi-iiL of the iueliiuitioii of the Hue lln-ough tlie clutn 


till? er-irrelatiriii is so perfect, there is no difficulty in mea: 
the line, because if a straig'lit line is drawn through th: 
four of those points it passes very near the others. J 
other eases, it is not so obvious which straight line is 
drawn ; and then we can proceed by the method of 
squares already taken, or yon can proceed by the foil 
practical method which yields good results : — Mark on 
lines horizonially and vertically through the averages < 
two groups, and rotate a ruler through their poi' 
intersection until the same imiiiber of dors is found on tl 
side of it as on the other. It will be found that that n: 
gives a definite position of the line which passes very ne: 
points ; it is a purely empirical way ; but as the coef 
of correlation need generally not be calculated 
great minuteness^, it will in general be sufficiently cc 
It is often absurd in cases of probability to wot] 
the results with very gi’eat accuracy. The line i 
drawn in the diagram above, because it would 
obscured the dots ; but underneath is given the ta 
of the inclination to the horizontal of the line which 
satisfy the conditions, the tangent of this angle i 
The second diagram is constructed in a similar way, f< 
median ages of wives, whose husbands are in a 
group ; the tangent of the inclination of the line th 
the points is now *92. The average age of the husba 
42*16 years, with standard deviation 12*6 years; the 
age of the wives 40*11 years, with standard dev 
12*1 years. The statement we have now ob 
is of this sort : — If we are dealing with a man wliog 
is h, in excess of the average, and we wish to 
the age of his wife ; the value of w in the eq 
A— 42*16= *92(io— 40*11) is nearly the most probable va 
her age. That comes at once from the geometry of the s 
diagram. From the first diagram we obtain simila 
Given the age of a woman as being w, so that the de’^ 
from the average is tc— 40*11, then the median age - 
husband group is 42*24+ *97 (to —40* 11). We shall pn 
also need to know the curve of frequency for each of 
groups. Unless there is a reason to the contrary, I th 
general that we may assume that the curve of frequenc;) 
selected group is similar to the curve of frequency fi 
whole group from which it was selected. So that v 
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calculate the standard of de^^ation for this particular ciirre if 
frequency when you know the standard of deviation for the 
whole gi’oup. That is to sa^q ^ve can ascertain tlie chance 
that the age of the husband of a particular woman is 
any assigned number of years above or below the age here 
selected. In the little table given above we can actually find 
these small curves of frequency ; for instance^ in the ages of 
mves 30 to 35 years of age the curve of frequency for the 
husbands goes as follows : — 9^ 49^ 29, 7, 2, 1 . 

The above is the graphic way of working out the question, 
We have now to show its relation to the formula for r, the 
coefficient of correlation. The quantity hi in the equation 
y = rcr^l(Ti is the quantity evaluated by the diagram as *97. 
that is the tangent of the inclination of the line to the axis of z, 
where ages of husbands and wives are measured on the axes 
of i& and y respectively, and cti, o-o are the standard de\dations 
for wives and husbands. If I had reduced all the measure- 
ments to the standard of deviation beforehand, r, the coefficient 
of correlation, would have been the tangent of the inclination 
of the line. The question wdiich is the easier to -work, decides 
which of the two methods you adopt. If you work it as 
I have done, with the same scale of years vertically and 

horizontally, you would have to say that r=:*97x~,and 


from the lower diagram that r=-92x whence r is the 


geometrical mean between *97 and *92, between the tangents 
of the inclinations of the line calculated on the tw*o different 


hypotheses, namely, *945 ; and 


/•92 

~ V-97“” 


‘974. 


Now let us proceed 


to calculate r 


by the formula 




It is, of course, a long business, and I shall not give the work 
completely; I shall only indicate the way in which it was done. 
The problem was to find that product for 610,000 pairs, which, 
of course, is a prohibitive piece of work, and cannot be done 
accurately, because the ages are not given except in 5 yearly 
limits. We proceed by approximation. First of all, I neglect 
all the numbers below 1,000; secondly, I assume that the 
numbers left are at the middle of their respective groups. 
Then I deal with the 60 or 70 numbers in the table on p. 67 in 
the following way. Select a group, e.p., wives whose ages are 



*25 to 30; the middle, 27 h, h 12*6 years below the average 
i>f. all wives; express this and other deviations in terms 
standard deviation, 12*12 ; 12*6--- 12*12 = 1*04. That is 
term to be applied throughout tliis group of husbands 
through a similar process for the This 6 is at the i 

€if the group 20-25 years, namely, 22|, wliicli is 19f 
below the average age of all husbands, and that in tei 
the standard de%dations is about T6 ; work out the 
deviations, which are in arithmetic progression, ii 
same way. Then multiply the numbers in the group, 6, ^ 
7, 2, 1 each by its deviation, add, and multiply the sum 1 
deviation T04 for the group. 

Example : 


Age or Wives, 25— SO 

Distante from Average- 1*04 of Standae© Deviation 


A.:e of 

Hiisbaud-s i 

Distaiiee from 
Average 

No. 

Product 

20-2S 

-1*6 


6 

- 9-6 

25-30 

-1-2 

= 

49 

-58*8 

30-35 

- -8 


31 

-24-8 

35-40 

- *4 

f r? 

7 

- 2-8 

40-45 

+ *03 


2 

+ *1 

45-50 

1 

+ -4 J 


1 

+ *4 

Sum — 96’5 

Corresponding j^rt of 2 — . — 

1 a'-j 

is -1*04 of- 

96*5 = + 103*6 


Some of the resulting terms will be negative unle 
correlation is considerable. Add these terms, and divi( 
sum by 7i (in this case 602), and the coefficient of corre 
*06 is obtained. In the method I suggest using we c 
deal with any large numbers at all. The number is i 
from the geometric mean of *945 found by the graphic n 

above. Also ~=*96 (instead of *97 as reckoned a 

0*2 

r X — =*92 (the same as above), r x — = T0 (instead o 

CTi CTj 

Justification of the Foemula foe r. 

In the method of finding the formula for r on page > 
used the method of least squares without examinii 
suitability. I will now give reasons, which have not. 


as I know, been previously offered, in favour of tlii> iiiethcxl. 
If tlie figures we are dealing with belong to the iionnal curve 
of error, there is no difficulty. If their curve of frec|iieiiry 
has auy other form, still the averages of selected groups, 
represented by the dots in Diagrams XI and XII, are govenied 
by normal curves of error (see page 56). Let ^2, . , . 
be the averages of groups of y^s, containing respectively 
Zti, hoj . . . items, whose x values are Xi, so that 

the Jcr x^s which, in the grouping of x’s adopted, are in a small 
group whose centre is have i/-pairs whose average is yr- 
Let y = a^l&-\-h be a line which contains the values of y from 
wliich the observed yi, y^, &c., are deviations. Then 
(yr-cLXr’-h) is a quantity whose frequency-curve is 


7 } 


1 


1 ! 
C- y 


where c the modulus is inversely proportional 


to \/kr, kr being the number in the Xr, yr group. The 
probability of such deviations occurring together is (as on 
page 46) a maximum, when Ucr {y—axy—hf is a minimum. 
Equating the partial differentials of this sum with reference 
to a and h to zero, and remembering that Xr, 

if the deviations are measured from the general average, w^e 
have, as on page 65, 6=0, and {ax/--Xryr)=^0, Hence, 

where the summation extends 
'V 

over all the pairs. Then, as before, r = . 

^ 0*2 nCTiCTt 


Consideration of the nature of the formula will, I think, lead 
to the conclusion, that the coefficient of correlation calculated 
by the formula is a good measurement of correlation, whatever 
cmwes of frequency you are dealing with ; and it is surprising 
how very rapidly a small extent of correlation makes itself 
felt, even when you deal with quite a few examples. If n is 
only 20, you will soon find whether there is correlation or 
not by this formula. If you select groups where there is no 
correlation the criterion, discussed below, shows that the 
correlation is not significant ; but directly there is likely to be 
correlation between the groups, this formula for r shows it. 
The coefficient of correlation can be used then in a very large 
region of cases in which it is required to test the connection 
between two series of phenomena. In particular, it can be 
used to decide whether two series of phenomena are entirely 
unconnected or not, which subject necessitates a preliminary 
treatment of the nature of series. 



MEASUREMENT OF SERIES. 


SIXTH LECTURE. 


Series. 

I PROPOSE to deal in this lecture, first of all, with series 
in general, and then with the comparison of and correlation 
between two series. By a series I understand a list of 
numerical events recorded at regular intervals, for example, 
recorded once every year. In representing a series by a 
diagram we measure time on the horizontal axis, and 
dividing it up into years, we erect an ordinate at the point 
corresponding to each year, to represent on a suitable scale 
the magnitude at that particular year. The question whether 
we should represent these magnitudes by dots or lines or 
rectangles is important, but it is decided on the principles 
discussed when we were dealing with the representations of 
groups, and we need spend no more time on the analysis now. 
Perhaps the most natural way of representing such series is to 
erect a series of rectangles whose areas are proportional to 
the successive magnitudes; hut if we leave the diagram in 
that form it will not he very clear, it will be very ugly, and 
certainly this is not a neat way of finishing the representation. 
The next step is to draw a continuous line to replace the 
rectangles ; the commonest wmy of doing this is, to mark the 
middle points of the tops of the rectangles, and join those 
points by straight lines ; hut this method is erroneous, for the 
same reason that it was erroneous in the representation of a 
group. We need to draw a continuous line so that the areas 
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contained in the rectangles in the first place, iiiid lij tiie 
curved trapeziums in the second, shall he equal in every ease; 
but this more correct curve is practically coincidtuit with the 
erroneous straight lines; it makes very little difference in 
practice which of the two we draw ; they give certainly the 
same optical impression. If, however, we do replace the 
rectangles by a continuous line we are making an assumption 
wliich is sometimes justified, but sometimes not ; by the fact 
of drawing a continuous line we give the impression that the 
event represented is continually taking place. This is correct 
in representations of births, deaths, and marriages, and it is 
partly correct in representing imports and exports by curves 
but it is not correct iii the representation of events which 
only occur once each year. These are details which are easily 
analysed. 

Classification. 

The series, or the curves which represent them, can be 
di^dded into three main classes : periodic curves, symptomatic 
curves, and others; or instead of others, we may say curves 
with random fluctuations. Periodic curves are those where 
similar fluctuations recur at equal intervals of time, as the 
annual fluctuation of temperature recorded month by month. 
Symptomatic curves are those which have a definite tendency up 
or down, a symptom,” though short periods may obscure it, 
as the death rate since 1870. A. curve, which is neither 
periodic not symptomatic, may often be regarded as having 
random fluctuations about a stationary average, as a curve 
representing the annual averages of any meteorological 
phenomena, such as average temperature year by year. In 
the Diagram XIII* all four curves are symptomatic ; the first 
three are downwards, and the last upwards for the first 30 
years and then nearly level. The series represented in 
Diagram XIY has apparently random fluctuations. These 
curves are not periodic in any strict sense. 

Periodic Curves. 

The first thing to discuss is, how to disentangle the period 
from the symptom when a periodic curve is also symptomatic, 
or how to measure the period if the curve is not symptomatic. 
There is not space to discuss the matter completely, and I want 
rather to indicate the methods, and leave their consideration 

* See p. 81 . 
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to file reader. A curve often suggests two things : firsts that there 
is a re,Lnilar period, and, secondly, that there is a movement 
apart from the period. Assume that we are dealing with 
moil tidy observations and an annual period. To obtain the 
movement apart frum the period, take the averages of the 
12 months of each year and mark them on the diagram; 
these points would show the average rate for the year, when 
the readings of the vertical scale have been adjusted. But 
there is something arbitrary in beginning the year at the 1st 
of January. The deaths, births, and marriages, and any other 
figures we deal with are probably independent of that 
particular beginning of the year, and if we make comparisons it 
may be better to take other periods to start with ; for instance, 
the fiscal year begins on April the 5th. We want a continuous 
representation, which we can obtain as follows : — First take 
the average from January 1st to December 31st. Then the 
average from the 1st of February to January 31st, and so on 
until we get 12 dots every year. It is clear that the curve 
through these points cannot have any sudden fluctuations; the 
curve so obtained shows the symptom when the period is 
eliminated. The theory underlying this method is quite 
simple. If we take any particular 12 months, we shall 
include the whole influence of the period, the excess in one 
part and the defect in another, and if we average 
them we shall probably get the number which would 
have occurred if there had been no period, and if 
the flow had been regular. It is approximate only, because 
the various small fluctuations will aflect the average, and it 
can be improved by smoothing the curve. If the series is not 
symptomatic the resulting smooth curve should be a horizontal 
straight line. 

Now, in order to measure the period as apart from the 
symptom, the only method is to write down the rates for the 
50 Januaries which w^e may be dealing with, and take the 
arithmetical average, the mode, or the median of these; to 
repeat the process with the Februaries, and so on ; and then 
to represent the successive averages for the 12 months by a 
separate curve, which is best drawn with a base line through 
the general average of all the data. We thus get such a 
curve as that given by the graph of y=sin a?, from 0® to 360°. 
The justification of the method is simple. In the 50 Januaries 
we include one January from each part in the symptomatic 



curve. All tlie excesses due to the symptijiiiaiie teiitleiiev will 
be counter-balanced by the defects, or will tend u* l>e ixamivr- 
balanced by the defects, due also to s^-iiiptuiiiatie lendeiiey. 
They mil only tend to be counter-balanced ; for if we take 
the 50 Januaries we include among them some extrai>rtliiiarv 
months, and some months w’hose deviation from the aniiual 
average is quite small. The accuracy with which we may 
expect to get the true January reading is proportional to the 
square root of the number of times taken, from the theory of 
averages discussed above. In carrying out the method, we 
implicitly assume that the causes which decide the symptom 
and the causes which decide the period are independent, 
Avhile generally they are not independent. If there is an 
increasing death rate or an increasing want of employment 
at the same time that the winter is especially severe the one 
will accentuate the other. It is very easy to see how the 
result may be affected. Suppose some industrial disaster 
throws a great proportion out of work in August in one year, 
so as to increase the percentage of unemployed, we will say 
to 50, then when taking the average for ten years, that 
figure alone gives a rate of 5 per cent, in August, whereas 
the excess had nothing to do with the fact that August was 
the month concerned. If you take a sufficient number of 
years, however, those things will tend to equalize one 
another, and if we use the median instead of the arithmetic 
average extraordinary occurrences have little effect. For 
this reason it is best to estimate the period from the medians. 
In the end we shall not get a smooth curve for our averages, 
and may have to smooth that by a trigonometrical function, 
or by some other method. 

Symptomatic Seeies. 

We will now discuss the symptomatic curves; the top 
curve in Diagram XIII (male death-rate) will do as well 
as any as an illustration, for the method of dealing with this 
curve applies to a very great number of such curves. All 
statistics representing sociological phenomena that I have 
had experience of are symptomatic. Perhaps in very rare 
cases you will find no symptom, but in general there is a 
symptom; however remotely connected the figures are with 
the general progress of civilization, yon will find there is 
some symptom up or down, or alternately up and down. In 


g-eiienil we niay assume a symptom in ali figures re 
Immaii society. In dealing with such curves, we s 
want to examine them in detail for a short period ; 
often! we are more concerned with the symptom, espi 
forecasting events. In curve A in Diagram XIII i 
considerable and rapid fluctuations, but there is alsc 
optical evidence of a fall in the rate beginning betw( 
and 1870. The causes which produced the actual si; 
ordinate are, of course, very many, and it is impe 
draw the line between those which tend to make a 
permanent change, and those which tend to make 
temporary change. It is a question of degree an 
character, and for that reason alone it is impossible 
any theoretic solution for distinguishing the sympt 
the small fluctuations, just as it is impossible to | 
general solution to the interpolation problem. We h; 
to find an empirical solution, one that satisfies our ii 
needs. It might appear best to draw a straight lin 
on the w'hole shall differ from the observations as 
possible, and which could be determined by the m 
least squares ; this would assume a symptomatic ten' 
equal increments or decrements in successive years, 
might assume a parabolic curve or logarithmic ci 
recent American writer has assumed that a certain S€ 
he represented by y=i]cx^, the compound interest ( 
But I think in general there is no reason to ass; 
definite algebraic law. The solution I should sugge 
a commonplace one^ — is similar to that I have just sugg 
the removal of the period. It is most easily unden 
an example. 

The figures in the following table are from the E 
GeneraFs Eetums, or are calculated from the Si 
Abstract. 


Imports per Heai*. 

MAEEiAfiE Rate 

PKK 

Hate hF 
Mai . e > i-EK 


Deviation 


Deviation 


D - vna :: -n 


from 





Amount 

Moviii;! 

Hate 

Moving 

Moving 


Average 


Average - 


Average 

£ 



X ., 

£ 

X . 

3*30 ; 


17*2 


21*7 


3*15 ■ 


17*2 


23*9 


3*21 

-•13 

15*8 

— *7 

25*5 

^ 1*4 

2-91 ' 

-■44 

15*9 

-•6 

23*8 

.3 

3*52 

-•03 

16*2 

-•3 

25*8 

■ rl -9 

3-97 

- f *19 

17*2 

4*4 

21*4 

- 2*0 

4*14 

-•16 

17*2 

■0 

22*8 

- *6 

4*35 

— •35 

17*4 

•0 

232 

T - *1 

5*51 

4*58 

17*9 

4*7 

23*8 

4 *3 

5*51 

4*17 

17*2 

4*1 

24*4 

4 1*2 

5*16 

-•64 

16*2 

— *7 

23*5 

4 *4 

6*16 

4*30 

16*7 

-a *2 

21-3 

- 1*8 

6*66 

4 *65 

16*5 

*0 

22*6 

- *3 

5*&0 

-*64 

16 * 

— *7 

23*9 

4 1*3 

6*26 

— •45 

17 * 

4 *4 

23*3 

4 *4 

7*32 

4*40 

17*1 , 

4*6 

22*1 

- *8 

7*50 

4 *0o 

16*3 

-•4 

22*7 


7*72 

-*33 

16*1 

-•6 

22*4 

- -8 

8*45 

4 *05 

16*8 

•0 

24*1 

4 *4 

9*26 

4*40 

17*2 

+ 2 

24*9 

+ *8 

9*06 

-•06 

17*5 

4*4 

24*5 

4 *3 

9*80 

4*45 

17*5 

4 *5 

24*6 

4 *6 
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-•3 
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-•4 

23-6 

*0 
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*0 
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Take the last group of figures, the death rate of 
I take the average of the first five death rates, 20*1, 
to 24*4 iu 1849, "namely, 22*6, and place in the pen 
coliiiiiii at the middle of the period, namely, the ye 
I as^ain at the second year, 1846, and take the 

for 1846-1850, namely, 22*6 again, and place that at 1 
middle year of that period ; and so on for 46 si 
periods. Then on the Diagram XIII, I have represen 
line of moving averages by the dotted line running 
the continuous line. I think it is clear that that line O' 
solution of the problem. In taking the average of 
years we are equally likely to include the ups and d 
their fluctations. If there was a regular period, 
fluctuations were five-yearly we should remove them 
in five years, it would be the obvious time to take, 
were dealing with figures referring to industry and th 
was ten years, ten years would be the most appropriat 
of time to average, including as it would one contribut 
each part of the fluctuation. If there is no regula 
there is no rule to be given as to what number of y( 
shall take ; it is a matter of convenience. If the 
average gives you a curve with sharp angles and ap 
random fluctuations, increase the number of years. I 
convenient to work with an odd number of years, 
middle of the period then coincides with the middle c 
the years ; but, on the other hand, a period of ten jei 
arithmetical facilities. This method may, I think, 
for consideration; I believe it will be seen that it 
solution of the problem. To complete it, I rec 
replacing the dotted line by a regular curve drawn v 
to it, smoothing out any little fluctuations which 
A curve thus drawn would fall from 1847 to 
rise for about seven years and then faU, fairly ra] 
about 1882, and more slowly afterwards. In the n 
things we cannot fix exact years for the end of the ris* 
It is absolutely necessary to have some such me 
measuring the symptom before you can base any argu 
to the change in the quantity measured. That 
important. For example, the curve D, which re 
imports, is a sharply fluctuating curve with a partial 
If, to take a particular date, we had in 1879 looke( 


England. 



I A Deaths per 1,000 living — Males. 

B „ „ ,> „ —Females. 

C Persons married to 1,000 living. 

D Value of Iinporte per head of the populatioii of the United Kingdom. 
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previous two years only we slionlcl liave tlioiiglit tin 
rapid fall in tlie average imports ; but if we looke 
liistory of the phenomenon we should have seen 
appeared to be only part of a minor fluctuation^ and 
we should have seen that average imports had been s' 
on the whole for eight years. 

It is not possible to say at the moment whether a 
a periuaneiit nature or simply one of those little flu 
which characterize the phenomenon throughout 
century. For instance, by 1903 we can perhaps judc 
tendency in 1900, but cannot judge of the curr 
because we ha%’e not enough information. 

The delations obtained by subtracting the insta 
average from the figures for each year are given in 
column. The deviations for the first three groups o 
ill the table are calculated on a similar method, 
deviations should have some affinity to the curve 
Great deviations should be rare compared with small d€ 
and the occurrence of small and great deviations sho 
some such relation as the occurrence of great a: 
deviations in the curve of error; but the agreemei 
likely to be close, for the deviations calculated here 
independent one of the other ; they are bound togeth* 
fact that the same number is used in forming five si 
averages, while the curve of error assumes that the tl 
absolutely independent. 

COEEELATION BETWEEN SeEIES. 

That is a very rapid discussion of a rather wide 
but I must lead on to the correlation between twi 
figures. If we were dealing with a curve with no 
and no period, for instance, two sets of figures relatin 
weather, Xi^ representing the average tern' 

Vh ys ‘ • Vn representing the average wind velo( 
correlation between these two should be calculated as 
described. If we were dealing with a periodic ( 
should replace the periodic curve by its line of 
before comparing it with another curve. If the 
irregular period, then I think we should proceed as if 
a symptomatic curve with no period. Of course, 
periodic curves with the same period are correlated, 
sequences of events which are influenced by the 
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elianges in tlie weatlier will give a strong degree^ of correlation 
quite independently of anything else. That is a quantity 
which ill general will nor be worth ineasiiringj boi whcui we 
come ro very irregular period.s such as those which we fiiicl in 
trade statistics^ it is 'worth measuring the correlation even 
through the periods; because it is not so obvious^ for instance, 
that all the fluctuations of exports are correlated with all the 
fluctuations of import and that the two together are 
correlated with the amount of employment. 

A difficulty arises in dealing with many curves from the 
fact that the successive deviations year by year are not 
altogether independent. Many curves which deal -with 
sociological phenomena have fluctuations each of which 
extends over several years, so that a rise in one year is more 
often followed by a rise in the next than by a fall. Other 
curves have the opposite character, that an excess in one year is 
followred by defects in the other; for instance, if there is a 
gi’eat death rate in one year we may expect a comparatively 
small one in the follo'wing ; and this absence of in depcui deuce 
should be kept in mind when we have to base arguments on 
the resulting correlation. But apart from tliis, we could treat 
the deviations from the moving line of averages as deviations 
wdiose correlation w'e can fairly calculate. 

There is a very great difficulty in %vorking out the correla- 
tion bet\veen symptomatic curves. If we do not take the 
de%dation from the line of averages, but take the delations 
from the average for the whole 50 years, any two symptomatic 
curves will show correlation. If we take twm things which 
are absolutely disconnected, except that they are both 
phenomena arising in the progress of society, and work out 
the coefficient by the straightforward rule, we shall find there 
is some correlation. If two curves have short fluctuations 
which are correlated, but opposite symptoms, then owung to 
the symptom apart from the fluctuations there 'would be 
negative correlation, while owing to the fluctuations apart 
from the symptom there would be positive correlation; and 
when both are taken into account the correlation may be 
positive, zero, or negative. It is therefore necessary to treat 
the symptom separately from the short fluctuations. On the 
whole there is not much benefit in measuring the correlation 
coefficient for the symptoms; we should rather simply state that 
the symptom is say 15° upward in one case and 10° downward 



Ill tilt' The n>eful iriea>iiremeiit of tlie cor: 

lietwerii twa:* -ueli curves is not that of the s}Tiiptoms 
the fleviaiioii-. 

Ainither (|iiestioii which arises very often in a p 
wav is, whetlier w’e slionld compare the deviation 
wliole figures, say imports, with the deviation for tin 
say the marriage rate, in the same year, or in the iie^ 
Can we correlate the imports of 1847 with the marria 
of 1847, or should it be taken in comparison with 
184S“r That question will often occur, especially 1 
marriage and birth rates. Mr. Hooker has sugges 
suggestion wiiieli has been made independently in A 
that we should w'ork out the coefficients of correla 
the hypothesis of synchronism, and on alternate hy| 
that one event follows half or one year after the otl 
see wiiicIi correlation is the greatest. In this way we 
get a series of correlation coefficients according 
dates w’e take. 

Before w’e proceed to measure correlation by mathe 
formulae %ve should observe it purely graphically; i 
graphic representation of series will often suggi 
existence of correlation, which can then be measured 
mathematical formula. The curves A and B in Diagra 
are obviously closely correlated. In the curves B ai 
cannot decide from the figure whether there is correh 
not ; at any rate the evidence of correlation is not s( 
In the curves B and D, I do not think we could decic 
the figure as drawn, though we might perhaps from 
drawn in a different way, whether there was correh 
not. 

Let us proceed to discuss how to put two curves c 
as to get optical evidence as to whether they are co 
or not. Instead of measuring figures as in Diagrai 
measure as in Diagi’am XIV. Plot out the de 
calculated on p. 79 above and below a base line 
senting zero; but before doing so it is necessary to 
the relative scales of the two quantities so as to 
definite relation the one to the other. There is no 
way of comparing pounds sterling with one per thou 
the marriage rate. The way which naturally suggest 

* Jammed of the Mo^al Statistical Society, Sept. 1901. See 
pp. 490--1. I think that Mr. Hooker was also the first to publish a c 
of <x>rrelatioii based on deviations from a moving average. 
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and it is very useful for making the optical evidence of 
correlation Yi\dd^ is to represent the standard deviation for 
each group by unity on the vertical scale. The standard 
de-vdation for the death rate of males is -830 ; of females it 
is -803; so we represent the deviation ‘830 for males and 
•803 for females by the same vertical line. If we were 
doing the same thing for imports and male death rates, 'vve 

G 2 
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>li'- til j 1 represent l^y the same vertical scale ‘886 of a 
and '^S deatli rate. This method lias been app 
lliaifram XIV, for the comparison of lines A and 
Diairraiii XIII. 1 have pot in the death rate for 
rt'|:ire>eiiied by the zigzag line, but I found I could not 
ilie death rate for females in the same way and ma, 
lines distinct^ and therefore have drawni short her: 
lines to show the death rate for females year by yea 
dots and lines representing the two series are in iiearlj 
year close together. The optical evidence of correla' 
very great indeed. 

The illustration taken is of two series where tli 
relation is nearly perfect; in less perfect cases we c 
evidence by noticing whether the maxima and minima 
at the same dates in the two series. For example^ if w 
the value of exports and percentage of uiiemploy 
should find perhaps that the maximum of the one came 
same time as the minimum of the other throughout^ an 
would give strong optical evidence of negative corre 
A method of testing 'whether there was correlation c 
which would naturally suggest itself to anyone who 
small knowledge of probability^ would be to see how o 
positive deviation of the one agreed with a positive de^ 
of the other; how often like signs concurred and hov 
unlike signs concurred. If w^e wrote down 50 -f and — 
at random and another 50 alongside, the chances of g 
various numbers of agreements are easily calculated, ai 
in fact the successive terms in the expansion of (i + i)' 
that is not a good method, for it does not take into ai 
one of the most important considerations, whether a 
Hnctnation of the one corresponds with a great fluctuai 
the other or not. We should get equal evidence of corr( 
by this method when we had a resemblance of this s 
tw^o curves where great fluctuations corresponded 'with 
and small with small throughout, and when the correspoi 
was in sign only. Those things are obviously not ( 
same importance, and so the method of merely counting 
will not take us very far. 

We will, then, proceed by the method of calcu 
correlation described above. Referring now to the tal 
p. 79, it should be remarked that the method of eval 
the value of imports changes at the year 1852; I h 


approximate before iliat year from the values *A the expr,»rt>^ 
because tlie figures given by the Board of 1’ra.cle before and 
aften* that date are calculated on different ineiliufls, and 
not comparable. Otherwise the figures of tutal value vi 
imports for home consumption to the United .Kingdom an.* 
comparable. To my mind the imports are more sign ifi can i 
than the exports ; and also it seems to me al)surd to atlJ 
imports and exports; I do not think you can add ilieiii 
together any more than you can add bread to butter. I have 
taken the imports only_, and, without criticising the figures in 
detail, I have divided by the gi’oss population as given in the 
Statistical Abstract. Thus we get the amount per head 
given in the first coluimi in the table. I have only intended 
to work to the second place of decimals. The death and 
marriage rates are taken from the Eegistrar-Generalbs Eepori 
for 1895, which git^es the figures for the previous 50 years. 
The standard deviations given at the bottom of the table are 
obtained by taking the square root of the sum of the squares 
of the 46 deviations given, divided by 46, as in the ordinary 
formula for standard deviation. The standard deviation is 
essentially an absolute quantity' ^vfithoiit sign. 

I have calculated the coefficients of correlation between 
groups 1 and 2 (imports and marriage rate), between grumps 
1 and 3 (imports and death rates), between 2 and 3 (marriage 
and death rates), and bet^veen 3 and 4 (death rates for males 
and females). I have intended to choose cases wdiere, a ‘priori 
we might expect small correlation, no correlation, and great 
correlation. A priori we should expect correlation in the 
positive sense between imports and the marriage rate; not 
that increased imports cause an increase of the marriage rate, 
but the causes wdiich produce prosperity are likely to have 
effect ill increasing both imports and the marriage rate, the 
complexus of causes which decide the two things have 
something in common. The coefficient is *65. The marriage 
rate and death rate have presumably very little in common. 
One certainly could not say to start wnth whether an 
increasing death rate would synchronize with an increasing 
or with a diminishing marriage rate. The correlation between 
the two is —*19. The correlation between the imports and 
the death rate is — *22. The correlation between the death 
rate for males and that for females is -f *99 ; it is practically 1, 


liUi the huihher I euii bu «.^brainecl if ilieiv is an 

al>^^>ni!r pro|.j> »r!i’ »3i all through the scale, wliieli there i:' iioi in 

CiillEHlON K UF THE CoEKELATlUX 

C*>EFFICIENT. 

'Sow We are face to face with the question, "What do 
iiiiiiieriea] values mean, and which of them are 
>isf!iiticaiii ^ It i> clear that >ome such question arises^ 
^>ee^iu^e if we write down two series absolutely at raiiduni 
and Wiirk out their fcnuimhe the chances are very iiiueli 
against y>:air fsbiaininu zero, and there are heavy odds against 
tfiiiairiiiiir a .-luall number. Xow the chance of obtaining a 
e(.>etlicient near zero inerea.-e.^ with the number of terms. If 
we have two .-erie>, Ui, and rj, ^•2 . . . I’n, measured 

their averagto, and we select a group of r’s which are 
near to one aiiorlier, the ?ds which will be their factors in 
forming the sum of the products are equally likely to be 
positive or iiesrative; if we had an infinite number of these 
deviations their sum will be nothing; and the sum would 
tend to zero if we increased the number of terms, the actual 
deviation from zero being in inverse proportion to the square 
root of ti, the number of terms. Hence the number of terms 
taken has much to do with the significance of the resulting 
coefficient of correlation, and tve should expect that the 

quantity would enter into the measurement of the 

V n 

significance of the coeftieient of correlation. It is a little 
difficult to state and explain the measurement of the criterion 
of the significance ; but it is absolutely necessary to make the 
attempt. Of the coefficients just given, the first and fourth 
are found to be significant, and the second and third not, 
w’lien tested by the theoretical criteiion. 

Suppose we take two correlated gi^oups, and that there is, 
as a matter of fact, a definite value for the coefficient of 
correlation ; and then suppose we take 50 samples from each, 
that is to say, 50 pairs of events, we shall not naturally obtain 
exactly the coefficient of correlation that belongs to 
the whole groups. The chances are against obtaining 
exactly that result. Xow, the deviations from the actual 
coefficient of correlation w'hich are obtained by taking samples 



and finding the correlation have a curve of frec|oei]ev 

1 

y=z ' where 1/ IS the probabiikv that the ernkfieieiit 

C V TT 

obtained differs by from the true coefficient ^ and c, the 

modulus = (1 — ?*^) -^'^5 where r is the result obtained from 
the sample group^ which consists of n pairs. The probable 

error in this curve of error is *67 of For example, in 

V n 

the coefficient between imports and the marriage rate, /z = 4t3, 
the calculated coefficient of correlation is *65, and the prcdialile 

error for its curve of frequency is *67 x ^ ^ = *056. 

v/46 

That is to say^j from the calculation itself it is as likely as 
not that the actual coefficient is between *65 4- ‘056 and 
*65 — *056. The chance of the true coefficient being as much 
as the modulus, namely *115, distant from the calculated *65 
is sho^ra by the table of the error function to be only 10 in 
100 ; the chance of it being so far from *65 as to be actually 
zero are iiiiinitesinial, for in the curve of error the cases 
where the deviation is as much as six times the iriodiiliis are 
practically non-existent. So that we have riverwlieliiiinir 
e\'idence, if our general principle of calculation is correct, of 
correlation between the first and the second columns, and the 
most probable value of that correlation is two-tliirds. Tii 
other words, the standard deviation of imports being £*387, 
and the standard deviation of the marriage rate *37, 
the most probable deviation of the marriage rate is 
-i-|- of *37= *24, when we find a deviation in imports of 
+ £*387, and so on in proportion. This statement should be 
connected with the graphic measurement of correlation 
discussed on ]3. 70. In the second case of correlation, that 
between imports and the male death rate, where the 
coefficient is — *22 the probable error by the method just 
described is *09. That is to say, our calculation means that it 
is as likely as not, from our evidence, that the correlation 
between these two series is between — *13 and —*31} the 
chance that the real correlation is zero or positive is quite 
perceptible. The chance from the table of the error 

* See Pearson, in IRoyal Soc, Trans,, A. 175, p. 265 ; and correction, in 
Royal Soc. Rroceedings, Oct. 18th, 1897 ; also Yale, in Statistical Journal 
1897, p. 847, 
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rliiit a iieirative deviation as great as *22 should 
tlie pr.^hable error is '09 is about one in ten. If 
’vv t* uk Zvii oroups which had zero^ or slight positive 
e';r!vla!:-‘i:, in one these ^oroaps you might expect to get 
> 11 ' ni a a- — *22. Similarly, the chance that uncorrelated 

irr »iip- of 4»> pair.- should give the coefficient found 
between niarriage and male death rate, namely, — *19, is one 
in six : that is to say, once in six groups which were not 
eijiiiiecteil you would obtain that apparent correlation. The 
eiianees that you obtain the coefficient of correlation *99 from 
a raTKlcuii group is ])racrieally zero. That is to say, there is 
eorrelation lietween male and female death rates, and it is of 
siieli a nature that you could, given the deviation of death 
rate of iirales in the year, write down vuth very fair certainty 
the average death rate of females. For example, given that the 
death rate of males was -f *5 in excess of the moving average, 
that then the most probable death rate of females would be 

x*5, or *48 in excess of the average, and it is unlikely 

that any rate differing at all far from this will occur. 

We have thus found a way of measuring correlation, and 
of testing the significance of our measurement, betw'eeii two 
groups and between two series. The method must be used 
with discretion. There is no time to discuss under wdiat 
cireomstances it is applicable, nor the further developments 
of the theory. 


CoNCLrsiox. 

In these lectures I have tried to indicate the common-sense 
treatment of curve drawing and averages on the one hand, and 
the more delicate and exact method of representing groups 
and series by quantities based upon algebraic work on the 
other. Directly we attempt to use the latter methods, the 
algebraic methods, we find that we are bound to make 
approximations that involve the use of the theory of 
probability and the theory of error, and I have therefore 
been compelled to deal with these theories. When I have been 
treating them I have not attempted to promulgate any 
original opinions, I have only tried to illustrate principles, 
which are already laid down, by new examples. But since 
the modem shape of the theories of probability and error is 
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new, and involves some matters wliieli are still controversial — 
so far as matiiematical reasoning can be controversial — i 
have found it necessary to spend some little time in exaniiiiiiig 
the foundations of the theories in some detail. I have only 
been able to deal with the beginnings of some of the difficult 
questions which arise, and I am sorry that for want of rime 1 
have been compelled to leave out many illustrations of the 
practical utility of the methods; I have had to spend time on 
the theory rather than on the practice. My object will have 
been completely attained if I have succeeded in indicating 
the scope and the interest of the application of the theory of 
error, a subject which urgently needs the co-operation of 
serious students, alike to calculate experimental data, which 
are very much wanting, and to criticize, establish, and enlarge 
the body of theory. 



